Relevant reading for this problem set: ModernDive Chapter 6: Multiple Regression.
Please indicate who you collaborated with on this problem set:
We will again use the hate crimes data we used in Problem Set 05. The FiveThirtyEight article article about those data are in the Jan. 23, 2017 article “Higher Rates Of Hate Crimes Are Tied To Income Inequality”. This week, we will use these data to run regression models with a single categorical predictor (explanatory) variable and a single numeric predictor (explanatory) variable.
Remember you can knit this file to see the instructions. You can type your answers again right into this file, knit the final draft with all your answers. Please submit the pdf
file on gradescope.
First load the necessary packages
library(ggplot2)
library(dplyr)
library(moderndive)
library(readr)
Copy paste and run the following in a code chunk to read in the data:
hate_crimes <- read_csv("http://bit.ly/2ItxYg3")
Next let’s explore the hate_crimes
data set using the glimpse()
function from the dplyr
package:
glimpse(hate_crimes)
You should also take a look at the data in the data viewer.
Each case/row in these data is a state in the US. This week we will consider the response variable income
, which is the numeric variable of median income of households in each state.
We will use
urbanization
: level of urbanization in a regionhs
: the percentage of adults 25 and older with a high school degreeWe will start by modeling the relationship between:
hs
variablelow
, or high
, as contained in the variable urbanization
Create a data visualization comparing median household income
at “low” and “high” levels of urbanization (you do not need to include the hs
variable in this plot). Please include axis labels and title.
Answer:
Next, let’s add the high-school degree variable into the mix by creating a scatterplot showing:
income
on the \(y\) axisurbanization
urbanization
(one for “low”, one for “high”)For this question, add the regression lines to the plot using the geom_parallel_slopes
function from the moderndive
package. This function will draw the regression lines based on fitting a regression model with parallel slopes (i.e., with no interaction between hs
and urbanization
).
Do you think the relationship between hs
and income
is strong or weak? linear or non-linear?
Answer:
Which regression line (high urbanization
or low urbanization
) appears to have the larger intercept?
Answer:
Now let’s create a second scatterplot using the same variables, but this time draw the regression lines using geom_smooth
, which will allow for separate, non-parallel slopes for each urbanization group.
How do the slopes show in the plot above compare to the slopes in Question 2? Are the two slopes fairly similar (parallel) for the two levels of urbanization, or do they differ now?
Answer:
Based on visually comparing the two models shown in Question 2 and Question 3, do you think it would be best to run a “parallel slopes” model (i.e. a model that estimates one shared slope for the two levels of urbanization), or a more complex “interaction model” (i.e. a model that estimates a separate slope for the two levels of urbanization)?
Answer:
Fit the following two regression models that examine the relationship between household income
(as response variable), and high-school education (hs
) and urbanization
as explanatory variables:
hs
and urbanization
)hs
and urbanization
to interact in your model)Be sure to save the output from the lm
function for each model.
Use the get_regression_summaries
function to find the unadjusted proportion of variance in income
accounted for by each model, and report the value for each model
Answer:
Compare the adjusted proportion of variance account for each model. Based on this comparison, which model do you prefer? Does your preference here agree or disagree with your earlier preference based on visualizing the predictions of each model?
Answer:
For Questions 5 though 10, base your answers on the model you’ve selected using visual and quantitative comparisons in Question 3 and 4.
Generate the regression table for your preferred model using the get_regression_table()
function from the moderndive
package. Is the intercept the same for the states with a “low” and “high” level of urbanization? Is the slope the same?
Answer:
What is the slope for the regression line of the states with a “high” level of urbanization? What is the intercept?
Answer:
What is the slope for the regression line of the states with a “low” level of urbanization? What is the intercept?
Answer:
Based on your regression table output (and the data visualizations), is median household income greater in states that have lower or higher levels of urbanization? By how much?
Answer:
For every 1 percentage point increase of high-school educated adults in a state, what is the associated average increase in the median household income?
Answer:
What would you predict as the median household income for a state with a high level of urbanization, in which 85% of adults have a high school degree? Careful with rounding!
Answer:
What would you predict as the median household income for a state with a low level of urbanization, in which 85% of the adults have a high school degree?
Answer:
What would you predict as the median household income for a state with a low level of urbanization in which 30% of adults have a high school degree?
Answer:
What was the observed income
value for Maine (row 2)? What was the prediction for Maine according to your model? What is the residual? Did our model over or underestimate the median income for this state?
Answer:
You will now use the tools you have learned, and a new data set to solve a conservation problem.
Wildlife biologists are interested in managing/protecting habitats for a declining species of vole, but are not sure about what habitats it prefers. Two things that biologists can easily control with management is percent cover of vegetation, and where habitat improvements occur (i.e. is it important to create/protect habitat in moist or dry sites, etc). To help inform habitat management of this vole species, the researchers in this study counted the number of voles
at 56 random study sites. At each site, they measured percent cover of veg
etation, and recorded whether a site had moist or dry soil
.
The data can be read in like so:
vole_trapping <- read_csv("http://bit.ly/2IgDF0E")
The data contains the variables:
site
for the id of each random study site (each case or row is a survey/trapping site)voles
for the vole count at each siteveg
for the percent cover of vegetation at each sitesoil
identifying a site as “moist” or “dry”Generate an appropriate regression model with voles
as the response variable y
, and veg
and soil
as explanatory variables. Use the results of the model to answer the following questions based on the available data. A data visualization will probably also help you.
Would protecting a site with high vegetation cover be a more effective way to preserve the vole population than a site with low vegetation cover? Why?
Answer:
Dry sites typically cost a lot less to purchase and maintain for conservation organizations. Thus, if a conservation organization decides to purchase a few dry sites, roughly what percent cover of vegetation do they need to maintain on these sites (at a minimum) to support a population of about 30 voles at the site?
Answer:
The Nature Conservancy is looking at purchasing a site for this species (in the same study area) that has moist soil and 40% vegetation cover. Using the regression equation what would you predict as the possible vole population the site might be able to support?
Answer: