Relevant reading for this problem set: ModernDive Chapter 9: Hypothesis Testing.
Please indicate who you collaborated with on this problem set:
First load the necessary packages:
library(tidyverse)
library(infer)
For this Problem Set you will work with some grade-point-average (GPA) data for college freshman. The following will read in the data:
sat_gpa <- read_csv("https://rudeboybert.github.io/SDS220/static/PS/sat_gpa.csv")
Each row or case in this data frame is a student. The data includes:
We will use hypothesis testing to answer the following questions:
Note, if you get stuck as you are working through this, it will be helpful to review Chapter 9 in ModernDive.
For this question, let’s use a pre-determined \(\alpha\) significance-level of 0.05.
Calculate the mean GPA score for each gender, using the group_by
and summarize
commands from the dplyr
package.
Questions:
Answers:
Generate a data visualization that displays the GPAs of the two groups. Be sure to include a title and label your axes.
We will now test the null hypothesis that there’s no difference in population mean GPA between the genders at the population level. We can write this out in mathematical notation
\[\begin{aligned} H_0:&\mu_{female} = \mu_{male} \\\ \mbox{vs }H_A:& \mu_{female} \neq \mu_{male} \end{aligned}\]
or expressed differently, that the difference is 0 or not:
\[\begin{aligned} H_0:&\mu_{female} - \mu_{male} = 0 \\\ \mbox{vs }H_A:& \mu_{female} - \mu_{male} \neq 0 \end{aligned}\]
Here’s how we use the infer
package to conduct this hypothesis test:
Note that the order we choose does not matter here (female then male)…but since we used order = c("Female", "Male")
here, we should do the same in subsequent calculations!
obs_diff_gpa_sex <- sat_gpa %>%
specify(gpa_fy ~ sex) %>%
calculate(stat = "diff in means", order = c("Female", "Male"))
obs_diff_gpa_sex
stat |
---|
0.1485209 |
Note that this is the difference in the group means we calculated earlier!
2.544587 - 2.396066
## [1] 0.148521
This step involves generating simulated values as if we lived in a world where there’s no difference between the two groups. Going back to the idea of permutation, and tactile sampling, this is akin to shuffling the GPA scores between male and female labels (i.e. removing the structure to the data) just as we could have done with index cards.
gpas_in_null_world <- sat_gpa %>%
specify(gpa_fy ~ sex) %>%
hypothesize(null = "independence") %>%
generate(reps = 5000, type = 'permute')
Question:
gpas_in_null_world
data frame?Answer:
The following calculates the differences in mean GPA for males and females for “shuffled” (permuted) data.
gpa_diff_under_null <- gpas_in_null_world %>%
calculate(stat = "diff in means", order = c("Female", "Male"))
gpa_diff_under_null %>%
slice(1:5)
replicate | stat |
---|---|
1 | -0.0225343 |
2 | 0.0044534 |
3 | 0.0204698 |
4 | -0.0005518 |
5 | -0.0045158 |
Question:
Answer:
The following plots the \(\delta\) values we calculated for each of the different “shuffled” replicates. This is the null distribution of \(\delta\). The red line shows the observed difference between male and female scores in the data (-0.1485209) from step 1.
visualize(gpa_diff_under_null) +
shade_p_value(obs_stat = obs_diff_gpa_sex, direction = "both") +
labs(x = "Difference in mean GPA for males and females", y = "Count",
title = "Null distribution of differences in male and female GPAs",
subtitle = "Actual difference observed in the data is marked in red"
)
Note that zero is the center of this null distribution. The null hypothesis is that there is no difference between males and females in GPA score. In the permutations, zero was the most common difference, because observed GPA values were re-assigned to males and females at random. Differences as large as ~ 0.1 and -0.1 occurred, but much less frequently, because they are just not as likely when structure is removed from the data.
gpa_diff_under_null %>%
get_pvalue(obs_stat = obs_diff_gpa_sex, direction = "both")
p_value |
---|
0.002 |
This result indicates that there is a 0.1% chance (very low) chance that we would see a difference of 0.15 in GPA scores between males and females (or a bigger difference) if in fact there was truly no difference between the sexes in GPA scores in the population.
Fill in the blanks below to write up the results & conclusions for this test:
The mean GPA scores for females in our sample (\(\bar{x}\) = ______) was greater than that of males (\(\bar{x}\) = ______). This difference (was/was not)_ statistically significant at \(\alpha = 0.05\), (p = _______). Given this I (would/would not) reject the Null hypothesis and conclude that _____ have higher GPAs than _____ at the population level.
The following will allow us to calculate a 95% confidence interval for the difference between mean GPA scores for males and females.
ci_diff_gpa_means <- sat_gpa %>%
specify(gpa_fy ~ sex) %>%
generate(reps = 5000, type = "bootstrap") %>%
calculate(stat = "diff in means", order = c("Female", "Male")) %>%
get_confidence_interval(level = 0.95)
ci_diff_gpa_means
lower_ci | upper_ci |
---|---|
0.0564837 | 0.2352381 |
Note that all the above steps can be done with one line of code if a slew of assumptions like normality and equal variance of the groups are met.
t.test(gpa_fy ~ sex, var.equal = TRUE, data = sat_gpa)
##
## Two Sample t-test
##
## data: gpa_fy by sex
## t = 3.1828, df = 998, p-value = 0.001504
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.05695029 0.24009148
## sample estimates:
## mean in group Female mean in group Male
## 2.544587 2.396066
For this analysis sat_total
is the outcome variable, and gpa_hs
is the predictor variable, with two levels “low” and “high”. For this question, let’s use a pre-determined \(\alpha\) significance-level of 0.10, which is considered a more liberal significance-level than 0.05 since p-values will have an easier time being less than \(\alpha\), and thus we are likely to reject the null hypothesis \(H_0\) more often.
We can first calculate the mean total SAT score for each group (i.e students with a low and high GPA), using the group_by
and summarize
commands from the dplyr
package.
avg_sat_gpa <- sat_gpa %>%
group_by(gpa_hs) %>%
summarize(sat_total = mean(sat_total))
avg_sat_gpa
gpa_hs | sat_total |
---|---|
high | 108.67828 |
low | 98.23047 |
We will next generate a data visualization that displays the total SAT scores of the two groups. Be sure to include a title and label your axes.
ggplot(sat_gpa, aes(x = gpa_hs, y = sat_total)) +
geom_boxplot(fill = "darkgreen") +
labs(title = "SAT scores based on high school GPA scores",
x = "GPA ranking", y = "SAT score")
State the null hypothesis that you are testing (using either words or symbols)
Answer:
Calculate the observed difference between the mean total SAT scores of the low and high GPA high-school students.
# you finish this code....
# obs_diff_sat_hs_gpa <- sat_gpa %>%
Generate the null distribution of \(\delta\). Here you need to generate simulated values as if we lived in a world where there’s no difference in SAT scores between high school students with low and high GPAs.
# you finish this code....
# sat_in_null_world <- sat_gpa
Calculate the differences in mean SAT scores between students with low and high GPA scores under the Null. Note…you should use whatever order you chose above…i.e. order = c("low", "high")
or order = c("high", "low")
.
# you finish this code....
# sat_diff_under_null <-
Visualize how the observed difference compares to the null distribution of \(\delta\). Generate a histogram of the null distribution, with a vertical red line showing the observed difference in SAT scores between high school students with a high and low GPA.
# you finish this code....
# sat_diff_under_null %>%
Calculate a p-value
Answer:
Write up the results & conclusions for this hypothesis test. Note, p-values less than 0.001 are often reported as p < 0.001.
Answer:
Calculate a confidence interval for the difference in total SAT scores for students with high and low high-school GPA scores. Note…you should use whatever order you chose above…i.e. order = c("low", "high")
or order = c("high", "low")
.
# you finish this code....
# ci_diff_sat_means <- sat_gpa %>%
Use a t-test to test the null hypothesis that total SAT scores do not differ between students with high and low high school GPA scores at the population level.