Assignment 4 Answer Key
Question 1
Download the data:
diam <- read.csv("https://rtgodwin.com/data/diamond.csv")
(a)
mod1 <- lm(price ~ carat + I(carat^2) + colour + clarity, data = diam)
summary(mod1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2028.6 224.2 9.049 < 2e-16 ***
carat 4351.3 687.6 6.328 9.16e-10 ***
I(carat^2) 6455.1 522.5 12.353 < 2e-16 ***
colourE -1261.1 169.3 -7.448 1.04e-12 ***
colourF -1679.2 158.9 -10.566 < 2e-16 ***
colourG -2046.8 162.8 -12.574 < 2e-16 ***
colourH -2543.0 164.8 -15.434 < 2e-16 ***
colourI -3225.5 173.0 -18.646 < 2e-16 ***
clarityVS1 -1168.6 121.0 -9.661 < 2e-16 ***
clarityVS2 -1561.8 131.5 -11.875 < 2e-16 ***
clarityVVS1 -287.1 130.3 -2.203 0.0284 *
clarityVVS2 -836.9 120.8 -6.927 2.69e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 577.3 on 296 degrees of freedom
Multiple R-squared: 0.9723, Adjusted R-squared: 0.9712
F-statistic: 943 on 11 and 296 DF, p-value: < 2.2e-16
colour
and clarity
are categorical variables - R has automatically created a system of dummy variables for us.
(b)
The effect of a 0.1 increase in carat
will have a non-constant effect on price
(that is the appeal of the polynomial model). In a polynomial model, we can interpret the estimated results by considering specific scenarios. For example, let’s get the predicted increase in price
due to an increase in carat
of 0.1, when the diamond has a size of 0.1:
predict(mod1, data.frame(carat = 0.2, colour = "D", clarity = "VS2")) - predict(
mod1, data.frame(carat = 0.1, colour = "D", clarity = "VS2"))
628.7867
This has taken the difference in the predicted price of a diamond of size carat = 0.2
and carat = 0.1
. The predicted effect is $628.79.
When using the predict
function, we must choose values for all of the $X$ variables. I arbitrarily chose colour = "D"
and clarity = "VS2"
. As we will see in the next part, these choices do not matter for the predicted effect.
The whole point of the non-linear model is that the effect of carat
on price
is non-constant. To illustrate that you understand this, you need to get the predicted effect of an increase of carats of 0.1, for a different starting value. For example, I will compare the effect of a 0.1 increase for when carat = 0.5
:
predict(mod1, data.frame(carat = 0.6, colour = "D", clarity = "VS2")) - predict(
mod1, data.frame(carat = 0.5, colour = "D", clarity = "VS2"))
1145.196
The same increase in size of 0.1 has almost double the effect when the diamond is 0.5 carats vs. 0.1 carats.
Question 2
Load the data:
cps <- read.csv("http://rtgodwin.com/data/cps1985.csv")
(a)
Estimate the model using the lm()
function:
cps.mod <- lm(log(wage) ~ education + gender + age + experience, data = cps)
summary(cps.mod)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.89621 0.69229 1.295 0.196
education 0.17746 0.11371 1.561 0.119
gendermale 0.25736 0.03948 6.519 1.66e-10 ***
age -0.07961 0.11365 -0.700 0.484
experience 0.09234 0.11375 0.812 0.417
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4525 on 529 degrees of freedom
Multiple R-squared: 0.2703, Adjusted R-squared: 0.2648
F-statistic: 48.99 on 4 and 529 DF, p-value: < 2.2e-16
(b)
The estimated effect of education on wage has a percentage change interpretation, since wage is in logs. Each additional year of education is associated with an approximate 17.8% increase in wage. However, with a p-value of 0.119 the effect is not significant.
(c)
If there is heteroskedasticity (instead of homoskedasticity), then the standard errors, t-statistics, and p-values, are all wrong.
We need to install two packages before we can estimate the “robust” standard errors (and associated t-statistics and p-values):
install.packages("lmtest")
library(lmtest)
install.packages("sandwich")
library(sandwich)
coeftest(cps.mod, vcov = vcovHC(cps.mod, "HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.896211 0.135704 6.6041 9.753e-11 ***
education 0.177461 0.011424 15.5345 < 2.2e-16 ***
gendermale 0.257361 0.039388 6.5340 1.507e-10 ***
age -0.079611 0.011266 -7.0662 5.047e-12 ***
experience 0.092341 0.012371 7.4644 3.467e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
There are some very important differences from the output in part (a). Under heteroskedasticity, all of the variables are now statistically significant!
Question 3
(a)
did <- read.csv("https://rtgodwin.com/data/card.csv")
did.mod <- lm(EMP ~ STATE + TIME + STATE * TIME + CO_OWNED, data = did)
summary(did.mod)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.3025 1.1169 21.759 < 2e-16 ***
STATE -2.9418 1.2141 -2.423 0.015625 *
TIME -2.2833 1.5403 -1.482 0.138637
CO_OWNED -2.6611 0.7141 -3.727 0.000208 ***
STATE:TIME 2.7500 1.7170 1.602 0.109659
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.432 on 763 degrees of freedom
Multiple R-squared: 0.02533, Adjusted R-squared: 0.02022
F-statistic: 4.957 on 4 and 763 DF, p-value: 0.0005991
(b)
The DiD estimate is the coefficient on the interaction term: 2.75.