# Load all packages here:
library(readr)
library(dplyr)
library(ggplot2)
library(janitor)
# Set seed value of random number generator to get "replicable" random numbers.
# The choice of seed value of 76 was an arbitrary one on my part.
set.seed(76)
Include the code to load your data here. If your data is not confidential nor is it private in nature, consider publishing it as a .csv
file on Google Sheets as in the code chunk below; instructions on how to do this are in Steps 1-6 here. If the data shouldn’t be published online, then please submit the spreadsheet file on Moodle.
ma_schools <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrWSNyNqRVA950sdYa1QazAT-l0T7dl6pE5Ewvt7LkSm9LXmeVNbCbqEcrbygFmFyK4B6VQQGebuk9/pub?gid=1469057204&single=true&output=csv")
Pipe your data frame into the clean_names()
function from the janitor
package. This will clean your variable names, making them easier to work with.
ma_schools <- ma_schools %>%
clean_names()
Complete your data wrangling here:
Recall from our data proposal that we didn’t find a categorical variable to our liking in the original ma_schools
data. So we create a categorical explanatory variable school_size
with three levels (small
, medium
, large
) based on the values of total_enrollment
.
Furthermore, we want to limit our analysis to only high schools as younger grades do not take the SAT. In other words, those schools with 11th and 12th grade enrollments greater than 0.
# This converts the numerical variable total_enrollment into a categorical one
# school_size by cutting it into three chunks:
ma_schools <- ma_schools %>%
mutate(school_size = cut_number(total_enrollment, n = 3))
# For aesthetic purposes we changed the levels of the school_size variable to be
# small, medium, and large
ma_schools <- ma_schools %>%
mutate(school_size = recode_factor(school_size,
"[0,341]" = "small",
"(341,541]" = "medium",
"(541,4.26e+03]" = "large"))
# Next we filtered to only include schools that had 11th and 12th grade
# students. We do this because students in the 11th and 12th grade take the math
# SAT.
ma_schools <- ma_schools %>%
filter(x11_enrollment > 0 & x12_enrollment > 0)
select()
the following variables in this order and drop all others. Eliminating all unnecessary variables will making visually exploring the raw values less taxing mentally, as we’ll have less data to look at.
ma_schools <- ma_schools %>%
# Note the order in which we select
select(school_name, average_sat_math, percent_economically_disadvantaged, school_size)
Look at your data using the glimpse()
function.
glimpse(ma_schools)
## Observations: 390
## Variables: 4
## $ school_name <chr> "Abington High", "Agawam High…
## $ average_sat_math <int> 516, 514, 534, NA, 581, 592, …
## $ percent_economically_disadvantaged <dbl> 21.5, 22.7, 14.6, 74.2, 6.3, …
## $ school_size <fct> medium, large, large, small, …
Look at your data another way by displaying a random sample of 5 rows of your data frame by piping it into the sample_n(5)
function from the dplyr
package.
ma_schools %>%
sample_n(5)
school_name | average_sat_math | percent_economically_disadvantaged | school_size |
---|---|---|---|
Newton South High | 620 | 8.8 | large |
The Gateway to College | NA | 30.4 | small |
Hudson High | 512 | 19.3 | large |
Horace Mann School for the Deaf | NA | 73.3 | small |
Center For Technical Education Innovation | 555 | 36.5 | large |
Let’s do an little exploratory data analysis.
Address missing values.
Using the trick in Jenny and Albert’s “Become an R Data Ninja!” file, we get a sense of how many missing values we have in our data.
colSums(is.na(ma_schools))
## school_name average_sat_math
## 0 58
## percent_economically_disadvantaged school_size
## 0 0
We see that even after removing non-high schools, we still have 58 missing values for average_sat_math
. We’ll be sure to mention this in our ultimate project resubmission, since if there is a systematic reason the schools have a missing value, then just ignoring these schools may bias our results.
ma_schools <- ma_schools %>%
filter(!is.na(average_sat_math))
Compute some quick summary statistics of the outcome variable and comment.
ma_schools %>%
group_by(school_size) %>%
summarize(n = n(),
correlation = cor(average_sat_math, percent_economically_disadvantaged),
mean = mean(average_sat_math, na.rm = TRUE),
median = median(average_sat_math, na.rm = TRUE),
sd = sd(average_sat_math, na.rm = TRUE))
school_size | n | correlation | mean | median | sd |
---|---|---|---|---|---|
small | 28 | -0.8318577 | 478.0357 | 476 | 77.39197 |
medium | 69 | -0.8534511 | 483.2899 | 502 | 58.66658 |
large | 235 | -0.8095469 | 517.5021 | 521 | 56.15344 |
It appears our data consists of mostly large high schools. In all three school size cases, we observe a very strong negative corelation between average math SAT score and percentage of economically disadvantaged students.
Visualize the distribution of the outcome variable using a histogram and comment.
ggplot(ma_schools, aes(x = average_sat_math)) +
geom_histogram(binwidth = 30, color = "white", fill = "steelblue") +
labs(x = "Average Math SAT Score", y = "Number of Schools with the Score")
These data seem roughly bell-shaped, with no obvious skew. There is an outlier average Math SAT score of around 750 or so. This school is the MA Academy for Math and Science School.
Visualize the relationship of the outcome variable and the numerical explanatory variable using a scatterplot and comment.
ggplot(ma_schools, aes(x = percent_economically_disadvantaged, y = average_sat_math))+
geom_point() +
geom_smooth(method = "lm", se = FALSE ) +
labs(y = "Math SAT Score",
x = "Percentage of Economically Disadvantaged Students")
There appears to be a reasonably strong negative relationship between average math SAT scores and percent of economically disadvanted students across all high schools in MA. In other words, as the percent of economically disadvanted students of a high school increases, there is an associated decrease in average math SAT scores. This is an unfortunate, but alas expected, relationship.
Visualize the relationship of the outcome variable and the categorical explanatory variable using a scatterplot and comment.
ggplot(ma_schools, aes(x = school_size, y = average_sat_math, fill = school_size)) +
geom_boxplot() +
labs(y = "Math SAT Score", x = "School Size")
The Math SAT scores look to be the greatest at larger schools, and the lowest at smaller schools, though the differences do not seem to be extreme. There appear to be some potential outliers in SAT scores. Again, observe the outlier small school, which was the MA Academy for Math and Science School.
Visualize the relationship of the outcome variable and both explanatory variables using a colored scatterplot and comment.
ggplot(ma_schools, aes(x = percent_economically_disadvantaged, y = average_sat_math, color = school_size))+
geom_point() +
geom_smooth(method = "lm", se = FALSE ) +
labs(y = "Math SAT Score", x = "Percentage of Economically Disadvantaged Students")
The negative relationship between average math SAT scores and percent of economically disadvanted students across all high schools in MA we observed earlier seems to hold even when distinguishing between schools of different size using colors. There does not appear to be an interaction effect as all three slopes seem roughly equal.