# Load all packages here:
library(readr)
library(dplyr)
library(ggplot2)
library(janitor)
# Set seed value of random number generator to get "replicable" random numbers.
# The choice of seed value of 76 was an arbitrary one on my part.
set.seed(76)
Your data proposal may not follow the exact same steps in the example below. Jenny and Albert are merely providing one example to give you an overall sense of where you are heading; the details may change from project to project.
What is your research question?
Overall, we are hoping to better understand how and if school conditions influence student performance. Our question specifically is whether percentage of economically disadvantaged students and school size predict the average math SAT score in Massachusetts high schools.
Please give a very short description of the data set along with it’s original source.
We have data from all schools in Massachusetts from 2017. The data set contains SAT scores, and some information about school demographics. While the data was downloaded from Kaggle here, the original source of the data are Massachusetts Department of Education reports.
Include the code to load your data here. If your data is not confidential nor is it private in nature, consider publishing it as a .csv
file on Google Sheets as in the code chunk below; instructions on how to do this are in Steps 1-6 here. If the data shouldn’t be published online, then please submit the spreadsheet file on Moodle.
ma_schools <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrWSNyNqRVA950sdYa1QazAT-l0T7dl6pE5Ewvt7LkSm9LXmeVNbCbqEcrbygFmFyK4B6VQQGebuk9/pub?gid=1469057204&single=true&output=csv")
Piping your data frame into the clean_names()
function from the janitor
package will clean your variable names, making them easier to work with.
ma_schools <- ma_schools %>%
clean_names()
Be sure to explore your data. Note that eval=FALSE
is set so that R Markdown doesn’t “evaluate” this code chunk, i.e. it will ignore it in the ultimate .html
report. You should run this code on your own, but not in the ultimate .html
report.
glimpse(ma_schools)
What is your identification (ID) variable (if you have one)?
Our identification variable is school_name
; it allows us to uniquely identify each school.
What is your outcome variable \(y\)? What are its units of measurement?
The outcome variable is called average_sat_math
: the average Math SAT score for a school measured in points. We did not have data on overall SAT scores for each school, so we decided to focus solely on the math SAT score in order to make this report more concise.
What is your numeric explanatory variable? What are its units of measurement?
The numerical explantory variable is called percent_economically_disadvantaged
and its unit of measurement are percentage points between 0-100. It represents the percentage of students in a school that are considered “economically disadvantaged”.
What is your categorical explanatory variable? Does it have between 3 and 5 levels. Please list the different levels.
Since we didn’t feel any of the existing categorical variables in ma_schools
were appropriate for our analysis, we hope to create a new categorical explantory variable called size
, which will be based on the numerical variable total_enrollment
. We’ll divide this numerical variable into three equally sized groups, and hence our levels will be: small
, medium
, and large
. We will do this in the next step of the project once we are more experienced with data wrangling.
What is the observational unit of your data? In other words, what does each row in your data represent?
Each row/case in our data set will be a high school in Massachusetts (see below):
How many rows/cases are in the data i.e. what is the sample size? Is the sample size at least 50?
For now there are 1861 rows in our original data set ma_schools
. However, eventually we would like to limit our analysis to only high schools as younger grades do not take the SAT. We will do this in the next step of the project once we are more experienced with data wrangling.
select()
the following variables in this order and drop all others. Eliminating all unnecessary variables will making visually exploring the raw values less taxing mentally, as we’ll have less data to look at.
Recall that we didn’t find a categorical variable to our liking in the original ma_schools
data. So eventually we’ll create a categorical explanatory variable school_size
with three levels: small
, medium
, large
based on the values of total_enrollment
. We will do this in the next step of the project once we are more experienced with data wrangling.
ma_schools <- ma_schools %>%
select(school_name, average_sat_math, percent_economically_disadvantaged, total_enrollment)
Display a random sample of 5 rows of your data frame by piping it into the sample_n(5)
function from the dplyr
package . You’ll get the same 5 rows everytime you knit this document and hence replicable results because we set the seed value of the random number generator in the first code chunk above.
We see we have missing values below. Some of these are missing because some of the schools are not high schools so the students don’t take the Math SAT. We will address these missing values in the next step of the project once we are more experienced with data wrangling.
ma_schools %>%
sample_n(5)
school_name | average_sat_math | percent_economically_disadvantaged | total_enrollment |
---|---|---|---|
Peter Fitzpatrick School | NA | 80.0 | 10 |
Saltonstall School | NA | 33.6 | 372 |
Spark Academy | NA | 68.3 | 458 |
Madison Park High | 358 | 68.1 | 841 |
Pawtucketville Memorial | NA | 45.5 | 505 |