Download a CPS data set

Download the data from the website using:

cps <- read.csv("https://rtgodwin.com/data/cps1985.csv")

This is a sub-sample from the 1985 wave of the “Current Population Survey” in the US, which contains information on individual wages and demographic characteristics. Take a good look at the data set either by clicking on the spreadsheet icon next to its object name in the top-right window, or by using the command View(cps). I’ll just take a look at the first 6 observations for each variable:

head(cps)
##    wage education experience age ethnicity region gender occupation
## 1  5.10         8         21  35  hispanic  other female     worker
## 2  4.95         9         42  57      cauc  other female     worker
## 3  6.67        12          1  19      cauc  other   male     worker
## 4  4.00        12          4  22      cauc  other   male     worker
## 5  7.50        12         17  35      cauc  other   male     worker
## 6 13.07        13          9  28      cauc  other   male     worker
##          sector union married
## 1 manufacturing    no     yes
## 2 manufacturing    no     yes
## 3 manufacturing    no      no
## 4         other    no      no
## 5         other    no     yes
## 6         other   yes      no

Notice that many of the variables do not contain numbers, but are instead characters (words). For example, the ethnicity variable takes on values “hispanic”, “cauc”, and “other”. In order to use variables such as ethnicity, region and gender, we need to create dummy variables from them. From the ethnicity variable for example, we would create 2 dummies, even though there are 3 categories (in order to avoid the dummy variable trap).

R will recognize that variables like ethnicity contain several categories, and will automatically create the system of dummies for us. To see this, try:

summary(lm(wage ~ ethnicity, data = cps))
## Call:
## lm(formula = wage ~ ethnicity, data = cps)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.278 -3.765 -1.278  2.205 35.222 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         9.2779     0.2439  38.032   <2e-16 ***
## ethnicityhispanic  -1.9946     1.0146  -1.966   0.0498 *  
## ethnicityother     -1.2196     0.6711  -1.817   0.0697 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.117 on 531 degrees of freedom
## Multiple R-squared:  0.01227,    Adjusted R-squared:  0.008545 
## F-statistic: 3.297 on 2 and 531 DF,  p-value: 0.03776

How would you interpret the estimated coefficients?

Answer: Hispanic people make 1.9946 less on average compared to white people, and other ethnicities make 1.2196 less on average compared to white people, according to this data.

Estimate the returns to education

Let’s try some models in order to estimate the returns to education. Start with a simple model:

$wage = \beta_0 + \beta_1 education + \beta_2 gender + \epsilon$

Estimate this population model, and view the results, using:

model1 <- lm(wage ~ education + gender, data = cps)
summary(model1)
## Call:
## lm(formula = wage ~ education + gender, data = cps)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.888 -2.997 -0.709  2.255 35.888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.90623    1.04354  -1.827   0.0683 .  
## education    0.75128    0.07682   9.779  < 2e-16 ***
## gendermale   2.12406    0.40283   5.273 1.96e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.639 on 531 degrees of freedom
## Multiple R-squared:  0.1884, Adjusted R-squared:  0.1853 
## F-statistic: 61.62 on 2 and 531 DF,  p-value: < 2.2e-16

What are the estimated returns to education?

Answer: An additional year of education is associated with an additional $0.75 in wages.

LS predicted values for a representative case

One use of the estimated model above (model1), is to created an OLS predicted value for a “representative case”. That is, we could use the model to predict the wage of an individual with certain characteristics. How much (according to the estimated model) can a woman with 12 years of education expect to make? We can get this number with the following code:

predict(model1, data.frame(education = 12, gender = "female"))
##        1 
## 7.109175

This is the same as $-1.906 + 0.751(12) + 2.124(0) = 7.106$.

We just “plug” the numbers into the estimated equation. Using models to make predictions is very useful. You want to know how much a house will sell for? How much product you will sell at a certain price? How much GDP will be lost if the temperature rises by 3$^{\circ}$C?

Avoiding OVB - estimate a bigger model

In order to avoid omitted variable bias, we need to include variables that are correlated with education, and are determinants of wage. Notice that things like age and experience are correlated with education (think about why):

cor(cps$education, cps$experience)
## [1] -0.3526764
cor(cps$education, cps$age)
## [1] -0.1500195

Hence, when we add these variables to the regression, we should expect the estimate for the returns to education to change. Estimate a model using all of the variables available in the data set:

model2 <- lm(wage ~ education + experience + age + ethnicity + region + gender 
             + occupation + sector + union + married, data = cps)
summary(model2)
## Call:
## lm(formula = wage ~ education + experience + age + ethnicity + 
##     region + gender + occupation + sector + union + married, 
##     data = cps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.409  -2.486  -0.631   1.872  35.021 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.6508     6.7203   0.246  0.80605    
## education             0.8128     1.0869   0.748  0.45491    
## experience            0.2448     1.0818   0.226  0.82103    
## age                  -0.1580     1.0809  -0.146  0.88382    
## ethnicityhispanic    -0.6065     0.8696  -0.697  0.48584    
## ethnicityother       -0.8379     0.5745  -1.458  0.14532    
## regionsouth          -0.5627     0.4198  -1.340  0.18070    
## gendermale            1.9425     0.4194   4.631 4.60e-06 ***
## occupationoffice     -3.2682     0.7626  -4.286 2.17e-05 ***
## occupationsales      -4.0638     0.9159  -4.437 1.12e-05 ***
## occupationservices   -3.9754     0.8108  -4.903 1.26e-06 ***
## occupationtechnical  -1.3336     0.7289  -1.829  0.06791 .  
## occupationworker     -3.2905     0.8005  -4.111 4.59e-05 ***
## sectormanufacturing   0.5635     0.9915   0.568  0.57008    
## sectorother          -0.4774     0.9661  -0.494  0.62141    
## unionyes              1.6017     0.5127   3.124  0.00188 ** 
## marriedyes            0.3005     0.4112   0.731  0.46523    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.282 on 517 degrees of freedom
## Multiple R-squared:  0.3265, Adjusted R-squared:  0.3056 
## F-statistic: 15.66 on 16 and 517 DF,  p-value: < 2.2e-16

Look at the results, and interpret the estimates. Which variables are statistically significant?

Answer: gender, occupation, and union are the only variables that appear to be statistically significant.

Note: We can use the code lm(wage ~ ., data = cps) to estimate the above model, instead of typing every variable out. The . tells R to include every variable in the data set on the right-hand-side of the population model.

Adjusted R-square

What percentage of the variance in wage can be explained using the model? Now that we are working with a multiple regression model, we need to look at the “Adjusted R-squared” instead of the “Multiple R-squared”. The answer is 30.56%.

Review: single hypothesis test (t-test)

Using the above estimated model, test the hypothesis that the returns to education are zero. The null and alternative hypotheses are:

$H_0: \beta_{educ} = 0$

$H_A: \beta_{educ} \neq 0$

The t-statistic associated with this test is:

$t = \frac{b_{educ} - 0}{s.e.(b_{educ})}$

Take another look at the summary for model2, and note that $b_{educ} = 0.8128$ and that $s.e.(b_{educ}) = 1.0869$. The t-statistic is then:

test1 <- 0.8128 / 1.0869
test1
## [1] 0.7478149

Notice test1 pop up in the top-right environment window. Find this value in the summary(model2) table.

We need a p-value for this t-statistic, but first we need to know the degrees of freedom for this t-distribution. The sample size is 534 (see in the top-right), and we have estimated 17 $\beta$s (count them), so that the degrees of freedom are $534 - 17 = 517$. Find the degrees of freedom in the summary(model2) table.

Now our p-value is found by:

(1 - pt(test1, 517)) * 2
## [1] 0.4549118

The pt() command gives the probability to the left of t-statistic. Since our t-statistic is greater than zero, we want the area to the right, so we have to use 1 - pt(), and since our alternative hypothesis is two sided we need to multiply by two: * 2. Be sure that you can find this p-value in the summary(model2) table.

Notice that all tests of “no effect” have been automatically performed by R, but that any test where the $\beta$ takes a non-zero value under the null hypothesis needs to be calculated manually.

Multiple hypothesis test (F-test)

Education, experience, and age are variables that are highly multicollinear (see Section 6.4.2 in the textbook on imperfect multicollinearity). They appear to be statistically insignificant (according to their large p-values). If we want to drop them from the model, we need to test to see if they are jointly insignificant:

$H_0: \beta_{education} = 0 \text{ and } \beta_{experience} = 0 \text{ and } \beta_{age} = 0$

$H_A: \text{not } H_0$

  • Since it is a joint hypothesis (the hypothesis includes multiple restrictions on the $\beta$), we need to use an F-test.
  • To calculate the F-statistic, we can estimate two different models, and compare them using the anova() function.
  • One model we need to estimate is the unrestricted model (the model under the alternative hypothesis).
  • The other model we need to estimate is the restricted model (under the null hypothesis).

Unrestricted model

This has already been estimated as model2.

Restricted model

We take the unrestricted model, and substitute in the values for the $\beta$ that have been specified in $H_0$:

model2.restricted <- lm(wage ~ ethnicity + region + gender + occupation + sector
                        + union + married, data = cps)

anova()

We put both estimated models into the anova() function:

anova(model2, model2.restricted)
Analysis of Variance Table

Model 1: wage ~ education + experience + age + ethnicity + region + gender + 
    occupation + sector + union + married
Model 2: wage ~ ethnicity + region + gender + occupation + sector + union + 
    married
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1    517  9480.8                                  
2    520 10396.0 -3   -915.19 16.636 2.481e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The F-statistic is 16.636. According to the table below, the 5% critical value is 2.60 (we have three restrictions in $H_0$, so $q = 3$). Since the F-stat is greater than 2.60, we reject the null that education, experience, and age, have no effect on wage.

We can also look at the p-value to make our “reject” or “fail to reject” decision. The p-value for this test is 2.481e-10 or 0.0000000002481. We reject the null at any common significance level.

The results of this test indicate that we should not drop these three variables from the model

q 5% critical value
1 3.84
2 3.00
3 2.60
4 2.37
5 2.21

Another F-test

Test the joint significance of the ethnicity, sector, and married variables (which all appear to be statistically insignificant according to the t-tests):

model3 <- lm(wage ~ education + experience + age + region + gender 
             + occupation + union, data = cps)
anova(model2, model3)
Analysis of Variance Table

Model 1: wage ~ education + experience + age + ethnicity + region + gender + 
    occupation + sector + union + married
Model 2: wage ~ education + experience + age + region + gender + occupation + 
    union
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    517 9480.8                           
2    522 9599.2 -5   -118.44 1.2918 0.2659

With a p-value of 0.2659, we fail to reject the null that these variables have no effect on wage. This suggests that we might drop the variables from the model. The final estimated model is:

summary(model3)
Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)           0.7047     6.6569   0.106 0.915739    
education             0.7682     1.0865   0.707 0.479852    
experience            0.1896     1.0818   0.175 0.860961    
age                  -0.0958     1.0808  -0.089 0.929402    
regionsouth          -0.6878     0.4155  -1.655 0.098503 .  
gendermale            1.8471     0.4161   4.439 1.11e-05 ***
occupationoffice     -3.3463     0.7608  -4.398 1.32e-05 ***
occupationsales      -3.9753     0.9151  -4.344 1.68e-05 ***
occupationservices   -4.1478     0.8061  -5.145 3.78e-07 ***
occupationtechnical  -1.2713     0.7287  -1.745 0.081651 .  
occupationworker     -2.7993     0.7573  -3.697 0.000242 ***
unionyes              1.5180     0.5089   2.983 0.002988 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.288 on 522 degrees of freedom
Multiple R-squared:  0.3181,	Adjusted R-squared:  0.3037 
F-statistic: 22.13 on 11 and 522 DF,  p-value: < 2.2e-16