# Load all packages here:
library(readr)
library(dplyr)
library(ggplot2)
library(janitor)

# Set seed value of random number generator to get "replicable" random numbers.
# The choice of seed value of 76 was an arbitrary one on my part.
set.seed(76)

Data

Load data into R

Include the code to load your data here. If your data is not confidential nor is it private in nature, consider publishing it as a .csv file on Google Sheets as in the code chunk below; instructions on how to do this are in Steps 1-6 here. If the data shouldn’t be published online, then please submit the spreadsheet file on Moodle.

ma_schools <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrWSNyNqRVA950sdYa1QazAT-l0T7dl6pE5Ewvt7LkSm9LXmeVNbCbqEcrbygFmFyK4B6VQQGebuk9/pub?gid=1469057204&single=true&output=csv")

Clean variable names

Pipe your data frame into the clean_names() function from the janitor package. This will clean your variable names, making them easier to work with.

ma_schools <- ma_schools %>% 
  clean_names()

Data wrangling

Complete your data wrangling here:

Recall from our data proposal that we didn’t find a categorical variable to our liking in the original ma_schools data. So we create a categorical explanatory variable school_size with three levels (small, medium, large) based on the values of total_enrollment.

Furthermore, we want to limit our analysis to only high schools as younger grades do not take the SAT. In other words, those schools with 11th and 12th grade enrollments greater than 0.

# This converts the numerical variable total_enrollment into a categorical one
# school_size by cutting it into three chunks:
ma_schools <- ma_schools %>% 
  mutate(school_size = cut_number(total_enrollment, n = 3))

# For aesthetic purposes we changed the levels of the school_size variable to be
# small, medium, and large
ma_schools <- ma_schools %>%
  mutate(school_size = recode_factor(school_size, 
                                     "[0,341]" = "small", 
                                     "(341,541]" = "medium", 
                                     "(541,4.26e+03]" = "large"))

# Next we filtered to only include schools that had 11th and 12th grade
# students. We do this because students in the 11th and 12th grade take the math
# SAT.
ma_schools <- ma_schools %>%
  filter(x11_enrollment > 0 & x12_enrollment > 0)

Preview of data

Pare down variables

select() the following variables in this order and drop all others. Eliminating all unnecessary variables will making visually exploring the raw values less taxing mentally, as we’ll have less data to look at.

First: The identification variable (if any)
Second: The outcome variable \(y\)
Third: The numerical explanatory variable
Fourth: The categorical explanatory variable
Rest: any other variable you find interesting

ma_schools <- ma_schools %>% 
  # Note the order in which we select
  select(school_name, average_sat_math, percent_economically_disadvantaged, school_size)

Look at your data using glimpse

Look at your data using the glimpse() function.

glimpse(ma_schools)

## Observations: 390
## Variables: 4
## $ school_name                        <chr> "Abington High", "Agawam High…
## $ average_sat_math                   <int> 516, 514, 534, NA, 581, 592, …
## $ percent_economically_disadvantaged <dbl> 21.5, 22.7, 14.6, 74.2, 6.3, …
## $ school_size                        <fct> medium, large, large, small, …

Show a preview of your data

Look at your data another way by displaying a random sample of 5 rows of your data frame by piping it into the sample_n(5) function from the dplyr package.

ma_schools %>% 
  sample_n(5)

school_name	average_sat_math	percent_economically_disadvantaged	school_size
Newton South High	620	8.8	large
The Gateway to College	NA	30.4	small
Hudson High	512	19.3	large
Horace Mann School for the Deaf	NA	73.3	small
Center For Technical Education Innovation	555	36.5	large

Exploratory data analysis

Let’s do an little exploratory data analysis.

Inspect for missing values

Address missing values.

Using the trick in Jenny and Albert’s “Become an R Data Ninja!” file, we get a sense of how many missing values we have in our data.

colSums(is.na(ma_schools))

##                        school_name                   average_sat_math 
##                                  0                                 58 
## percent_economically_disadvantaged                        school_size 
##                                  0                                  0

We see that even after removing non-high schools, we still have 58 missing values for average_sat_math. We’ll be sure to mention this in our ultimate project resubmission, since if there is a systematic reason the schools have a missing value, then just ignoring these schools may bias our results.

ma_schools <- ma_schools %>%
  filter(!is.na(average_sat_math))

Summary statistics

Compute some quick summary statistics of the outcome variable and comment.

ma_schools %>% 
  group_by(school_size) %>% 
  summarize(n = n(), 
            correlation = cor(average_sat_math, percent_economically_disadvantaged),
            mean = mean(average_sat_math, na.rm = TRUE), 
            median = median(average_sat_math, na.rm = TRUE), 
            sd = sd(average_sat_math, na.rm = TRUE))

school_size	n	correlation	mean	median	sd
small	28	-0.8318577	478.0357	476	77.39197
medium	69	-0.8534511	483.2899	502	58.66658
large	235	-0.8095469	517.5021	521	56.15344

It appears our data consists of mostly large high schools. In all three school size cases, we observe a very strong negative corelation between average math SAT score and percentage of economically disadvantaged students.

Histogram of outcome variable

Visualize the distribution of the outcome variable using a histogram and comment.

ggplot(ma_schools, aes(x = average_sat_math)) +
  geom_histogram(binwidth = 30, color = "white", fill = "steelblue") +
  labs(x = "Average Math SAT Score", y = "Number of Schools with the Score")

Figure 1. Distribution of average math SAT scores for MA high schools in 2017

These data seem roughly bell-shaped, with no obvious skew. There is an outlier average Math SAT score of around 750 or so. This school is the MA Academy for Math and Science School.

Scatterplot

Visualize the relationship of the outcome variable and the numerical explanatory variable using a scatterplot and comment.

ggplot(ma_schools, aes(x = percent_economically_disadvantaged, y = average_sat_math))+
  geom_point() +
  geom_smooth(method = "lm", se = FALSE ) +
  labs(y = "Math SAT Score", 
       x = "Percentage of Economically Disadvantaged Students")

Figure 2. Relationship between average math SAT score and percentage of economically disadvantaged students for MA high schools in 2017

There appears to be a reasonably strong negative relationship between average math SAT scores and percent of economically disadvanted students across all high schools in MA. In other words, as the percent of economically disadvanted students of a high school increases, there is an associated decrease in average math SAT scores. This is an unfortunate, but alas expected, relationship.

Boxplot

Visualize the relationship of the outcome variable and the categorical explanatory variable using a scatterplot and comment.

ggplot(ma_schools, aes(x = school_size, y = average_sat_math, fill = school_size)) +
  geom_boxplot() +
  labs(y = "Math SAT Score", x = "School Size")

Figure 3. Relationship between average math SAT score and school size for MA high schools in 2017

The Math SAT scores look to be the greatest at larger schools, and the lowest at smaller schools, though the differences do not seem to be extreme. There appear to be some potential outliers in SAT scores. Again, observe the outlier small school, which was the MA Academy for Math and Science School.

Colored scatterplot

Visualize the relationship of the outcome variable and both explanatory variables using a colored scatterplot and comment.

ggplot(ma_schools, aes(x = percent_economically_disadvantaged, y = average_sat_math, color = school_size))+
  geom_point() +
  geom_smooth(method = "lm", se = FALSE ) +
  labs(y = "Math SAT Score", x = "Percentage of Economically Disadvantaged Students")

Figure 4. Relationship between average math SAT score, percentage of economically disadvantaged students, and school size for MA high schools in 2017

The negative relationship between average math SAT scores and percent of economically disadvanted students across all high schools in MA we observed earlier seems to hold even when distinguishing between schools of different size using colors. There does not appear to be an interaction effect as all three slopes seem roughly equal.

SDS/MTH 220 Project Proposal Example

Jenny Smetzer & Albert Y. Kim

Last updated on 2019-02-27