# Load all packages here:
library(readr)
library(dplyr)
library(ggplot2)
library(janitor)

# Set seed value of random number generator to get "replicable" random numbers.
# The choice of seed value of 76 was an arbitrary one on my part.
set.seed(76)

Please note

Your data proposal may not follow the exact same steps in the example below. Jenny and Albert are merely providing one example to give you an overall sense of where you are heading; the details may change from project to project.

Big-picture

Research question

What is your research question?

Overall, we are hoping to better understand how and if school conditions influence student performance. Our question specifically is whether percentage of economically disadvantaged students and school size predict the average math SAT score in Massachusetts high schools.

Description of data

Please give a very short description of the data set along with it’s original source.

We have data from all schools in Massachusetts from 2017. The data set contains SAT scores, and some information about school demographics. While the data was downloaded from Kaggle here, the original source of the data are Massachusetts Department of Education reports.

Load data into R

Include the code to load your data here. If your data is not confidential nor is it private in nature, consider publishing it as a .csv file on Google Sheets as in the code chunk below; instructions on how to do this are in Steps 1-6 here. If the data shouldn’t be published online, then please submit the spreadsheet file on Moodle.

ma_schools <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrWSNyNqRVA950sdYa1QazAT-l0T7dl6pE5Ewvt7LkSm9LXmeVNbCbqEcrbygFmFyK4B6VQQGebuk9/pub?gid=1469057204&single=true&output=csv")

Clean variable names

Piping your data frame into the clean_names() function from the janitor package will clean your variable names, making them easier to work with.

ma_schools <- ma_schools %>% 
  clean_names()

Explore your data

Be sure to explore your data. Note that eval=FALSE is set so that R Markdown doesn’t “evaluate” this code chunk, i.e. it will ignore it in the ultimate .html report. You should run this code on your own, but not in the ultimate .html report.

glimpse(ma_schools)

Variables

Identification variable

What is your identification (ID) variable (if you have one)?

Our identification variable is school_name; it allows us to uniquely identify each school.

Outcome variable

What is your outcome variable \(y\)? What are its units of measurement?

The outcome variable is called average_sat_math: the average Math SAT score for a school measured in points. We did not have data on overall SAT scores for each school, so we decided to focus solely on the math SAT score in order to make this report more concise.

Numerical explantory variable

What is your numeric explanatory variable? What are its units of measurement?

The numerical explantory variable is called percent_economically_disadvantaged and its unit of measurement are percentage points between 0-100. It represents the percentage of students in a school that are considered “economically disadvantaged”.

Categorical explantory variable

What is your categorical explanatory variable? Does it have between 3 and 5 levels. Please list the different levels.

Since we didn’t feel any of the existing categorical variables in ma_schools were appropriate for our analysis, we hope to create a new categorical explantory variable called size, which will be based on the numerical variable total_enrollment. We’ll divide this numerical variable into three equally sized groups, and hence our levels will be: small, medium, and large. We will do this in the next step of the project once we are more experienced with data wrangling.

Rows/observations

Observational units

What is the observational unit of your data? In other words, what does each row in your data represent?

Each row/case in our data set will be a high school in Massachusetts (see below):

Sample size

How many rows/cases are in the data i.e. what is the sample size? Is the sample size at least 50?

For now there are 1861 rows in our original data set ma_schools. However, eventually we would like to limit our analysis to only high schools as younger grades do not take the SAT. We will do this in the next step of the project once we are more experienced with data wrangling.

Preview of data

Pare down variables

select() the following variables in this order and drop all others. Eliminating all unnecessary variables will making visually exploring the raw values less taxing mentally, as we’ll have less data to look at.

The identification variable
The outcome variable \(y\)
The numerical explanatory variable
The categorical explanatory variable
Optional: any other variable you find interesting

Recall that we didn’t find a categorical variable to our liking in the original ma_schools data. So eventually we’ll create a categorical explanatory variable school_size with three levels: small, medium, large based on the values of total_enrollment. We will do this in the next step of the project once we are more experienced with data wrangling.

ma_schools <- ma_schools %>% 
  select(school_name, average_sat_math, percent_economically_disadvantaged, total_enrollment)

Preview data

Display a random sample of 5 rows of your data frame by piping it into the sample_n(5) function from the dplyr package . You’ll get the same 5 rows everytime you knit this document and hence replicable results because we set the seed value of the random number generator in the first code chunk above.

We see we have missing values below. Some of these are missing because some of the schools are not high schools so the students don’t take the Math SAT. We will address these missing values in the next step of the project once we are more experienced with data wrangling.

ma_schools %>% 
  sample_n(5)

school_name	average_sat_math	percent_economically_disadvantaged	total_enrollment
Peter Fitzpatrick School	NA	80.0	10
Saltonstall School	NA	33.6	372
Spark Academy	NA	68.3	458
Madison Park High	358	68.1	841
Pawtucketville Memorial	NA	45.5	505

SDS/MTH 220 Data Proposal Example

Jenny Smetzer & Albert Y. Kim

Last updated on 2019-02-27