Relevant reading for this problem set: ModernDive Chapter 8: Bootstrapping and Confidence Intervals.
Please indicate who you collaborated with on this problem set:
In this problem set we will use a small sample of data from the General Social Survey. The survey is designed to monitor changes in both social characteristics and attitudes. You will work with a sample from one neighborhood. The full neighborhood of ALL individuals is the population. For this problem set we do not know the true population parameters for any of the variables, because we do not have data on every person in the neighborhood.
First load the necessary packages
# Recall that loading the tidyverse "umbrella" package loads ggplot2, dplyr, and
# readr all at once. Feel free to load these packages any way you choose.
library(tidyverse)
library(moderndive)
Next load the data set, from where it is stored on the web:
gss_sample <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSypSoDCMH2N76Vo2dZRPkw2q3t1mbvAXlOtgPDIsHg4NclAQFmER-BdvXH9_lrT40UQCVdPXOi_NMJ/pub?gid=257689625&single=true&output=csv")
Be sure to take a look at the data in the viewer. Each row in the data set is a person that was surveyed (100 rows or cases in total). The variables in the data set include each respondent’s age
, race
, and number of hours of TV watched a day tvhours
.
Setting a seed: We will take some random samples and build sampling distributions in this lab. In order to make sure R takes the same random sample every time you run your code, you can do what is called “setting a seed”. Do this in any code chunk that you take a random sample!
You can set a seed like so. Any number will do. (You do not need to run this right now…just showing you how)
set.seed(45)
The following code tells R to take 1000 bootstrap resamples from the gss_sample
data. You can set the seed to whatever value you like!
set.seed(42)
boot_samp_1000 <- gss_sample %>%
rep_sample_n(size = 100, reps = 1000, replace = TRUE)
Note a few important details about the rep_sample_n
function, and bootstrap sampling in general:
size = 100
tells R that each bootstrap resample we take has 100 cases… the size of the original samplereps = 1000
tells R to take 1000 bootstrap resamples (each of size 100).replace = TRUE
argument tells R that in each bootstrap resample, we can include a row from gss_sample
multiple times. So if for instance, respondent # 12 is the first random resample taken here, respondent 12 is still available to be resampled again at random. Thus, some people may appear multiple times in our bootstrap resample, and some people from the original data set may not appear at all.boot_samp_1000
.Take a look at the boot_samp_1000
data frame we just generated in RStudio’s data viewer. Note that the replicate
column labels each bootstrap resample (the first 100 rows are labeled 1, the next 100 rows are labeled 2, etc.)
How many rows does boot_samp_1000
have? Why?
Answer:
Let’s say we want to use the bootstrap resample that we just generated to calculate a confidence interval for the population mean \(\mu_{tv}\) of tvhours
. To do so, we need to know the sample mean \(\bar{x}\) of tvhours
for each of the 1000 bootstrap resamples. In this case, the sample mean \(\bar{x}\) of tvhours
for each bootstrap resample is our BOOTSTRAP STATISTIC. We can calculate that with two lines of code, like so:
boot_distrib_tv <- boot_samp_1000 %>%
group_by(replicate) %>%
summarize(boot_stat = mean(tvhours))
Note that:
group_by()
argument tells R to take the sample mean of tvhours
separately for each different replicate
in the bootstrap resample.boot_stat
This is the bootstrap distribution for the mean of
tvhours
!
Take a look at the boot_distrib_tv
we just created in RStudio’s data viewer.
How many values of the bootstrap statistic boot_stat
are there? Please explain why there are this many values of the bootstrap statistic.
Answer:.
The bootstrap distribution is shown in the figure below. This is a histogram of the boot_stat
values from boot_distrib_tv
.
We can now use the bootstrap distribution for the sample mean tvhours
\(\bar{x}\) to calculate a 95% confidence interval for the population mean tvhours
\(\mu_{tv}\), using the “95% rule for bell shaped distributions”, which states that the middle 95% of values of a bell/normal shaped distribution are between
\[\text{mean} \pm 1.96 \cdot SD\]
We can thus apply the 95% rule, like so:
boot_distrib_tv %>%
summarize(mean = mean(boot_stat),
se = sd(boot_stat),
lower_ci = mean - (1.96 * se),
upper_ci = mean + (1.96 * se))
mean | se | lower_ci | upper_ci |
---|---|---|---|
3.14418 | 0.3651603 | 2.428466 | 3.859894 |
You can also calculate a 95% confidence interval using the percentile method. The logic goes like this:
Since our bootstrap resample had 1000 values of boot_stat
:
boot_stat
values fall inside this 95% confidence interval, i.e. 95%totaling 100%. We can use the quantiles of the bootstrap distribution to find these values like so:
prct_ci_tv <- boot_distrib_tv %>%
summarize(lower_ci = quantile(boot_stat, 0.025),
upper_ci = quantile(boot_stat, 0.975))
prct_ci_tv
lower_ci | upper_ci |
---|---|
2.51 | 3.89 |
This method
boot_stat
fall (or 25 cases in this example… 25/1000 = 0.025)boot_stat
fall (or 25 cases in this example 975/1000 = 0.975)Based on these results, we are 95% confident that the true mean hours of TV watched \(\mu_{tv}\) in the population is between the upper and lower CI we just calculated.
The bootstrap distribution and the 95% confidence intervals we just calculated are shown in the figure below. This is a histogram of the boot_stat
values from boot_distrib_tv
. The green line is the lower bound of the 95% CI, and the blue line is the upper bound. 950 of the 1000 bootstrap resamples had a mean for tvhours
that fell between the green and blue lines…25 of the samples had a mean above the blue line, and 25 of the samples had a mean below the green line.
If we calculated a 90% confidence interval for the mean of tvhours
using this same bootstrap resample and the percentile method, roughly how many of the 1000 values of tv_mean
would fall between the green and blue lines?
Answer:
Use the bootstrap resampling distribution for tvhours
generated above (boot_distrib_tv
) and the percentile method to calculate a 99% confidence interval for the mean tvhours
.
Which confidence interval is WIDER: the 95% confidence interval or the 99% confidence interval for the population mean tvhours
\(\mu_{tv}\)? Why is this the case? Answer in terms of the trade-off between confidence level and confidence interval width.
Answer:
Use the bootstrap resample we generated above (boot_samp_1000
), to generate a bootstrap distribution for the sample mean respondent age
\(\bar{x}\) instead of tvhours
. Please be sure to name it something different than the bootstrap distribution for the sample mean of tvhours
Calculate a 95% confidence interval for the population mean respondent age
\(\mu_{age}\) using the 95% rule method.
Calculate a 95% confidence interval for the population mean respondent age
\(\mu_{age}\) using the percentile method.
How do the 95% confidence intervals you calculated in 6b and 6c compare? i.e. are the 95% CI values similar or are they pretty different?
Answer:
Use the bootstrap resampling distribution for the sample mean respondent age
and the percentile method to calculate an 80% confidence interval for the population mean respondent age \(\mu_{age}\).
The procedure for generating a bootstrap sampling distribution is VERY similar for categorical data. As an example we will generate a bootstrap sampling distribution for the proportion of respondents that identified as a Person of Color.
We already did this above! We can use the same boot_samp_1000
as before.
boot_distrib_POC <- boot_samp_1000 %>%
group_by(replicate) %>%
summarize(n = n(),
POC_count = sum(race == "POC"),
boot_stat = POC_count/n)
Note that with a categorical variable, the code differs in two important respects now:
race == "POC"
in each replicate.The following will calculate the 95% confidence interval for the proportion of people that identified as POC
using the 95% rule.
boot_distrib_POC %>%
summarize(mean = mean(boot_stat),
se = sd(boot_stat),
lower_ci = mean - (1.96 * se),
upper_ci = mean + (1.96 * se))
mean | se | lower_ci | upper_ci |
---|---|---|---|
0.24053 | 0.0421354 | 0.1579446 | 0.3231154 |
The following will calculate the 95% confidence interval for the proportion of people that identified as “POC” using the percentile method.
boot_distrib_POC %>%
summarize(lower_ci = quantile(boot_stat, 0.025),
upper_ci = quantile(boot_stat, 0.975))
lower_ci | upper_ci |
---|---|
0.16 | 0.33 |
Calculate a 95% CI for the population proportion of respondents \(p\) that identified as White using EITHER the percentile or 95% rule method. Note that you will first need to generate the bootstrap distribution for the proportion of respondents that identified as White
.
As described in moderndive chapter 8.7.2, not only can we generate confidence intervals using a computer/resampling as we’ve been doing until now, in many cases there also exists a mathematical formula! This however necessitates a little mathematical/probability theory; a topic we leave to a more advanced statistics class.
To generate a 95% confidence interval based on the theoretical normal distribution, we can use the following formula:
\[ \text{point estimate} \pm 1.96 \cdot \widehat{SE}\]
So, for instance if we wanted to calculate the 95% confidence interval for the population mean of tvhours
\(\mu_{tv}\) that respondents watched based on our sample:
tvhours
from the sample: \(\bar{x}\)\[\widehat{SE} \approx \frac{s}{\sqrt{n}}\]
where \(s\) is the sample standard deviation, which is a point estimate of the population standard deviation \(\sigma\).
Thus a 95% CI would be
\[ \bar{x} \pm 1.96 \cdot \widehat{SE} = \bar{x} \pm 1.96 \cdot \frac{s}{\sqrt{n}} \]
We can make these calculations in R like so:
gss_sample %>%
summarize(x_bar = mean(tvhours),
sd = sd(tvhours),
n = n(),
se = sd/sqrt(n),
lower_ci = x_bar - 1.96 * se,
upper_ci = x_bar + 1.96 * se)
x_bar | sd | n | se | lower_ci | upper_ci |
---|---|---|---|---|---|
3.14 | 3.592979 | 100 | 0.3592979 | 2.435776 | 3.844224 |
Write down the three 95% confidence intervals for the population mean of tvhours
\(\mu_{tv}\) you’ve computed in this problem set. Do this by replacing X, Y, A, B, P, and Q with the appropriate values you’ve computed.
When you are done, make sure all the |
in the table still line up so your results print out in a table!
CI construction method | lower value | upper value |
---|---|---|
Using boostrap: 95% rule | X | Y |
Using boostrap: percentile rule | A | B |
Using mathematical formula | P | Q |
In your opinion, would you say these three confidence intervals are similar?
Answer: