Lab 4
Download the data
This lab uses life expectancy and health care expenditure data from Greene, example 6.10 (Greene, W. (2012). H.(2012): Econometric analysis. Journal of Boston: Pearson Education, 803-806). Download the data from the website using:
health <- read.csv("https://rtgodwin.com/data/health.csv")
Explore the data
The data is from 1997, and contains information on 191 countries with:
Variable | Description |
---|---|
DALE | Disability adjusted life expectancy, in years |
HEXP | Per capita health care expenditure |
HC3 | Average years of education |
OECD | A dummy variable =1 if country is in OECD, 0 otherwise |
GINI | The GINI coefficient of income inequality |
GEFF | An index measuring government effectiveness |
VOICE | An index measuring the level of democracy |
TROPICS | A dummy variable =1 if country is in the tropics, 0 otherwise |
POPDEN | A measure of population density |
PUBTHE | Proportion of health expenditure paid by bublic authorities |
GDPC | GDP per capita |
In this lab, we want to investigate whether OECD countries are more efficient in their delivery of health care (DALE), through spending and education. That is, we want to see if the effect of HEXP and HC3 on DALE is different for OECD vs. non-OECD countries.
Before we can do this, we need to specify the right functional form. We need to see if there are non-linearities between the variables of interest.
Create a plot of life expectancy and education:
plot(health$HC3, health$DALE)
There appears to be a possible non-linear (concave) relationship between education and life expectancy. To capture this, I will use a polynomial of HC3. Next, let’s look at a plot of life expectancy and health spending:
plot(health$HEXP, health$DALE)
There is definitely a non-linear relationship here. It is possible that, instead of a dollar amount change in spending, a proportional (percentage) change in spending has a constant effect of life expectancy. To capture this idea, we take the log of spending:
plot(log(health$HEXP), health$DALE)
We have “linearized” the relationship! Before we start modelling, let’s colour-code the data point by OECD status:
health$col <- "red"
health$col[health$OECD == 1] <- "blue"
plot(log(health$HEXP), health$DALE,
main = "Health care expenditure and life expectancy by OECD status",
xlab = "log health expenditure", ylab = "disability adjusted life expectancy",
pch = 16, cex = 1, col = health$col)
legend("bottomright", c("non OECD", "OECD"), pch = 16, col = c("red", "blue"))
From the plot, it is difficult to tell whether the effect of spending on life expectancy differs between OECD and non-OECD countries.
Estimate a model with interaction terms
In order to allow the effect of education and spending to differ by OECD status, we need to allow the OECD dummy to interact with the education and spending. We also need to allow for the non-linear relationship between the variables. To accomplish all this, we’ll estimate the model:
\[DALE = \beta_0 + \beta_1 \log(HEXP) + \beta_2 HC3 + \beta_3HC3^2 + \beta_4HC3^3\] \[+ \beta_5OECD + \beta_{13}[OECD \times log(HEXP)] + \beta_{14}[OECD \times HC3] + \beta_{15}[OECD \times HC3^2] + \beta_{16}[OECD \times HC3^3]\] \[+ \beta_6GINI + \beta_7TROPICS + \beta_8POPDEN + \beta_9PUBTHE + \beta_{10}GDPC + \beta_{11}VOICE + \beta_{12}GEFF + \epsilon\]In the model above, the 1st line contains the main variables of interest (health expenditure and education), the 2nd line contains the OECD dummy interacting with the main variables of interest, and the 3rd line contains the “controls” (variables that we may need to avoid omitted variable bias).
To estimate this model in R, we can use:
mod1 <- lm(DALE ~ log(HEXP) + HC3 + I(HC3^2) + I(HC3^3) + OECD + OECD*log(HEXP)
+ OECD*HC3 + OECD*I(HC3^2) + OECD*I(HC3^3) + GINI + TROPICS + POPDEN
+ PUBTHE + GDPC + VOICE + GEFF, data = health)
summary(mod1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.596e+01 6.262e+00 4.145 5.30e-05 ***
log(HEXP) 5.364e+00 1.098e+00 4.885 2.33e-06 ***
HC3 3.779e+00 3.015e+00 1.253 0.21177
I(HC3^2) -1.373e-01 5.940e-01 -0.231 0.81748
I(HC3^3) -6.023e-03 3.545e-02 -0.170 0.86531
OECD 1.616e+01 8.266e+01 0.195 0.84525
GINI -2.475e+01 7.091e+00 -3.491 0.00061 ***
TROPICS -2.336e+00 1.164e+00 -2.007 0.04625 *
POPDEN 1.082e-04 1.813e-04 0.597 0.55148
PUBTHE -2.528e-03 2.597e-02 -0.097 0.92258
GDPC -1.834e-04 1.921e-04 -0.955 0.34097
VOICE 5.840e-01 8.521e-01 0.685 0.49406
GEFF 9.972e-02 1.091e+00 0.091 0.92731
log(HEXP):OECD -7.010e-01 2.527e+00 -0.277 0.78177
HC3:OECD -1.119e+00 3.541e+01 -0.032 0.97484
I(HC3^2):OECD -1.506e-01 4.516e+00 -0.033 0.97344
I(HC3^3):OECD 1.238e-02 1.870e-01 0.066 0.94730
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.17 on 174 degrees of freedom
Multiple R-squared: 0.7694, Adjusted R-squared: 0.7481
F-statistic: 36.28 on 16 and 174 DF, p-value: < 2.2e-16
Model selection
Order of the polynomial
The polynomial in HC3 is order 3 (it goes up to a cubed term), but it looks like the cubed term may be insignificant. To test to see if the cubed term is needed (if it’s insignificant), we actually need to perform a joint hypothesis test to see if both $HC3^3$ and $OECD \times HC^3$ are jointly insignificant. To do this, we can use the F-test. Estimate a model without the cubed terms (the restricted model), and compare the two models using the anova()
function:
mod2 <- lm(DALE ~ log(HEXP) + HC3 + I(HC3^2) + OECD + OECD*log(HEXP)
+ OECD*HC3 + OECD*I(HC3^2) + GINI + TROPICS + POPDEN
+ PUBTHE + GDPC + VOICE + GEFF, data = health)
anova(mod1, mod2)
Model 1: DALE ~ log(HEXP) + HC3 + I(HC3^2) + I(HC3^3) + OECD + OECD *
log(HEXP) + OECD * HC3 + OECD * I(HC3^2) + OECD * I(HC3^3) +
GINI + TROPICS + POPDEN + PUBTHE + GDPC + VOICE + GEFF
Model 2: DALE ~ log(HEXP) + HC3 + I(HC3^2) + OECD + OECD * log(HEXP) +
OECD * HC3 + OECD * I(HC3^2) + GINI + TROPICS + POPDEN +
PUBTHE + GDPC + VOICE + GEFF
Res.Df RSS Df Sum of Sq F Pr(>F)
1 174 6624.4
2 176 6625.6 -2 -1.1719 0.0154 0.9847
The p-value is very large (0.9847), so we fail to reject the null (the restricted model) that the cubed terms have no effect. Our new model is quadratic in education (goes up to the squared term):
summary(mod2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.531e+01 5.025e+00 5.037 1.16e-06 ***
log(HEXP) 5.372e+00 1.079e+00 4.979 1.52e-06 ***
HC3 4.253e+00 1.100e+00 3.865 0.000156 ***
I(HC3^2) -2.367e-01 9.325e-02 -2.539 0.011999 *
OECD 1.964e+01 2.497e+01 0.786 0.432680
GINI -2.482e+01 7.029e+00 -3.531 0.000528 ***
TROPICS -2.336e+00 1.157e+00 -2.020 0.044924 *
POPDEN 1.090e-04 1.800e-04 0.605 0.545702
PUBTHE -1.865e-03 2.513e-02 -0.074 0.940942
GDPC -1.835e-04 1.873e-04 -0.980 0.328595
VOICE 5.744e-01 8.454e-01 0.679 0.497767
GEFF 1.169e-01 1.079e+00 0.108 0.913862
log(HEXP):OECD -6.868e-01 2.278e+00 -0.301 0.763393
HC3:OECD -2.808e+00 5.986e+00 -0.469 0.639549
I(HC3^2):OECD 1.040e-01 3.580e-01 0.290 0.771842
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.136 on 176 degrees of freedom
Multiple R-squared: 0.7693, Adjusted R-squared: 0.751
F-statistic: 41.92 on 14 and 176 DF, p-value: < 2.2e-16
Next, wee see if the squared terms are needed. We drop them from the model, and again perform an F-test where we compare model 2 and model 3:
mod3 <- lm(DALE ~ log(HEXP) + HC3 + OECD + OECD*log(HEXP)
+ OECD*HC3 + GINI + TROPICS + POPDEN
+ PUBTHE + GDPC + VOICE + GEFF, data = health)
anova(mod2, mod3)
Model 1: DALE ~ log(HEXP) + HC3 + I(HC3^2) + OECD + OECD * log(HEXP) +
OECD * HC3 + OECD * I(HC3^2) + GINI + TROPICS + POPDEN +
PUBTHE + GDPC + VOICE + GEFF
Model 2: DALE ~ log(HEXP) + HC3 + OECD + OECD * log(HEXP) + OECD * HC3 +
GINI + TROPICS + POPDEN + PUBTHE + GDPC + VOICE + GEFF
Res.Df RSS Df Sum of Sq F Pr(>F)
1 176 6625.6
2 178 6876.0 -2 -250.46 3.3265 0.03819 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
With a p-value of 0.03819 we reject the null hypothesis (that the squared terms are not needed), and decide to keep the model with the squared terms (model 2).
Dropping insignificant variables
Several of the controls look insignificant. We test the null hypothesis that they are not needed by dropping them all (restricted model under the null), and then comparing them to model 2 (unrestricted model under the alternative hypothesis).
mod4 <- lm(DALE ~ log(HEXP) + HC3 + I(HC3^2) + OECD + OECD*log(HEXP)
+ OECD*HC3 + OECD*I(HC3^2) + GINI + TROPICS, data = health)
anova(mod2, mod4)
Model 1: DALE ~ log(HEXP) + HC3 + I(HC3^2) + OECD + OECD * log(HEXP) +
OECD * HC3 + OECD * I(HC3^2) + GINI + TROPICS + POPDEN +
PUBTHE + GDPC + VOICE + GEFF
Model 2: DALE ~ log(HEXP) + HC3 + I(HC3^2) + OECD + OECD * log(HEXP) +
OECD * HC3 + OECD * I(HC3^2) + GINI + TROPICS
Res.Df RSS Df Sum of Sq F Pr(>F)
1 176 6625.6
2 181 6709.3 -5 -83.706 0.4447 0.8167
With a p-value of 0.8167 we favour the restricted model (we fail to reject the null). The controls that we dropped are jointly insignificant. We now have the model:
summary(mod4)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.34268 3.62726 6.987 5.20e-11 ***
log(HEXP) 4.81602 0.60615 7.945 2.00e-13 ***
HC3 4.42646 1.04809 4.223 3.81e-05 ***
I(HC3^2) -0.24207 0.08979 -2.696 0.007684 **
OECD 22.82463 23.61794 0.966 0.335128
GINI -22.58137 6.74951 -3.346 0.000998 ***
TROPICS -2.20459 1.10996 -1.986 0.048522 *
log(HEXP):OECD -1.60578 2.01375 -0.797 0.426259
HC3:OECD -2.14219 5.86605 -0.365 0.715400
I(HC3^2):OECD 0.06755 0.35046 0.193 0.847381
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.088 on 181 degrees of freedom
Multiple R-squared: 0.7664, Adjusted R-squared: 0.7548
F-statistic: 65.98 on 9 and 181 DF, p-value: < 2.2e-16
Test the hypothesis that OECD countries are more efficient at producing life expectancy
We want to test if the effect of health care expenditure, and education, is different between OECD and non-OECD countries. To test this hypothesis, we can estimate two models. The unrestricted model mod4
has already been estimated above. The restricted model is obtained by dropping all of the interaction terms. The restricted model does not have interaction variables that allow OECD to have differeing effects:
mod5 <- lm(DALE ~ log(HEXP) + HC3 + I(HC3^2) + OECD + GINI + TROPICS,
data = health)
anova(mod4, mod5)
Model 1: DALE ~ log(HEXP) + HC3 + I(HC3^2) + OECD + OECD * log(HEXP) +
OECD * HC3 + OECD * I(HC3^2) + GINI + TROPICS
Model 2: DALE ~ log(HEXP) + HC3 + I(HC3^2) + OECD + GINI + TROPICS
Res.Df RSS Df Sum of Sq F Pr(>F)
1 181 6709.3
2 184 6814.3 -3 -105.02 0.9444 0.4204
With a p-value of 0.4204, we fail to reject the null hypothesis (we choose mod5
in favour of mod4
. We cpnclude that OECD countries are not more efficient than non-OECD countries, in terms of increasing life expectancy through education and health care spending.
Interpreting the coefficients
As an aside, let’s interpret some of the estimated coefficients from mod5
.
summary(mod5)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.85363 3.56998 6.962 5.74e-11 ***
log(HEXP) 4.60608 0.57572 8.001 1.35e-13 ***
HC3 5.07654 0.90346 5.619 7.01e-08 ***
I(HC3^2) -0.30281 0.07346 -4.122 5.67e-05 ***
OECD -1.27083 1.79905 -0.706 0.48084
GINI -22.09751 6.65420 -3.321 0.00108 **
TROPICS -2.31260 1.10483 -2.093 0.03770 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.086 on 184 degrees of freedom
Multiple R-squared: 0.7627, Adjusted R-squared: 0.755
F-statistic: 98.59 on 6 and 184 DF, p-value: < 2.2e-16
The estimated coefficient 4.60608
is interpreted as: a 1\% increase in per capita health care spending is associated with an increase in life expectancy of 0.046 years (about 17 days). Education has a positive (the sign on HC3
is positive) but diminishing (the sign on HC3sq
is negative) effect on life expectancy. For countries with low levels of education the effect is large:
life.for.3.edu <- predict(mod5, data.frame(HEXP=100, HC3=3, OECD=0, GINI=.5, TROPICS=0))
life.for.2.edu <- predict(mod5, data.frame(HEXP=100, HC3=2, OECD=0, GINI=.5, TROPICS=0))
life.for.3.edu - life.for.2.edu
3.562519
For countries with only 2 years of average education, an extra year of education increases life expectancy by 3.5 years on average. In the above R code, we used mod5
to get LS predicted values for two representative countries (one with 3 years and one with 2 years), and took the difference. The other values that we chose for the “x” variables could be anything; the effects cancel out when we take differences.
We repeat the same procedure for a country with 8 years and 7 years of average education:
life.for.8.edu <- predict(mod5, data.frame(HEXP=100, HC3=8, OECD=0, GINI=.5, TROPICS=0))
life.for.7.edu <- predict(mod5, data.frame(HEXP=100, HC3=7, OECD=0, GINI=.5, TROPICS=0))
life.for.8.edu - life.for.7.edu
0.5344686
The effect has diminished to 0.5 years.