Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
451 views
in Technique[技术] by (71.8m points)

dplyr - Using boot::boot() function with grouped variables in R

This is a question both about using the boot() function with grouped variables, but also about passing multiple columns of data into boot. Almost all examples of the boot() function seem to pass a single column of data to calculate a simple bootstrap of the mean.

My specific analysis is trying to use the stats::weighted.mean(x,w) function which takes a vector 'x' of values to calculate the mean and a second vector 'w' for weights. The main point is that I need two inputs into this function - and I'm hoping the solution will generalize to any function that takes multiple arguments.

I'm also looking for a solution to use this weighted.means function in a dplyr style workflow with group_by() variables. If the answer is that "it can't be done with dplyr", that's fine, I'm just trying to figure it out.

Below I simulate a dataset with three groups (A,B,C) that each have different ranges of counts. I also attempt to come up with a function "my.function" that will be used to bootstrap the weighted average. Here might be my first mistake: is this how I would set up a function to pass in the 'count' and 'weight' columns of data into each bootstrapped sample? Is there some other way to index the data?

Inside the summarise() call, I reference the original data with "." - Possibly another mistake?

The end result shows that I was able to achieve appropriately grouped calculations using mean() and weighted.mean(), but the calls for confidence intervals using boot() have instead calculated the 95% confidence interval around the global mean of the dataset.

Suggestions on what I'm doing wrong? Why is the boot() function referencing the entire dataset and not the grouped subsets?

library(tidyverse)
library(boot)


set.seed(20)

sample.data = data.frame(letter = rep(c('A','B','C'),each = 50) %>% as.factor(),
                         counts = c(runif(50,10,30), runif(50,40,60), runif(50,60,100)),
                         weights = sample(10,150, replace = TRUE))



##Define function to bootstrap
  ##I'm using stats::weighted.mean() which needs to take in two arguments

##############
my.function = function(data,index){

  d = data[index,]  #create bootstrap sample of all columns of original data?
  return(weighted.mean(d$counts, d$weights))  #calculate weighted mean using 'counts' and 'weights' columns
  
}

##############

## group by 'letter' and calculate weighted mean, and upper/lower 95% CI limits

## I pass data to boot using "." thinking that this would only pass each grouped subset of data 
  ##(e.g., only letter "A") to boot, but instead it seems to pass the entire dataset. 

sample.data %>% 
  group_by(letter) %>% 
  summarise(avg = mean(counts),
            wtd.avg = weighted.mean(counts, weights),
            CI.LL = boot.ci(boot(., my.function, R = 100), type = "basic")$basic[4],
            CI.UL = boot.ci(boot(., my.function, R = 100), type = "basic")$basic[5])

summarize results

And below I've calculated a rough estimate of 95% confidence intervals around the global mean to show that this is what was going on with boot() in my summarise() call above

#Here is a rough 95% confidence interval estimate as +/-  1.96* Standard Error


mean(sample.data$counts) + c(-1,1) * 1.96 * sd(sample.data$counts)/sqrt(length(sample.data[,1]))



enter image description here


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The following base R solution solves the problem of bootstrapping by groups. Note that boot::boot is only called once.

library(boot)

sp <- split(sample.data, sample.data$letter)
y <- lapply(sp, function(x){
  wtd.avg <- weighted.mean(x$counts, x$weights)
  basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
  CI.LL <- basic[4]
  CI.UL <- basic[5]
  data.frame(wtd.avg, CI.LL, CI.UL)
})

do.call(rbind, y)
#   wtd.avg    CI.LL    CI.UL
#A 19.49044 17.77139 21.16161
#B 50.49048 48.79029 52.55376
#C 82.36993 78.80352 87.51872

Final clean-up:

rm(sp)

A dplyr solution could be the following. It also calls map_dfr from package purrr.

library(boot)
library(dplyr)

sample.data %>%
  group_split(letter) %>% 
  purrr::map_dfr(
    function(x){
      wtd.avg <- weighted.mean(x$counts, x$weights)
      basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
      CI.LL <- basic[4]
      CI.UL <- basic[5]
      data.frame(wtd.avg, CI.LL, CI.UL)
    }
  )
#   wtd.avg    CI.LL    CI.UL
#1 19.49044 17.77139 21.16161
#2 50.49048 48.79029 52.55376
#3 82.36993 78.80352 87.51872

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
...