Chapter 17: Limited Dependent Variable Models and Sample Selection Corrections
Example 17.1
First, look at the linear model.
data("mroz")
lm.17.1 <- lm(inlf ~ nwifeinc + educ + exper + expersq + age +
kidslt6 + kidsge6, data = mroz)
summary(lm.17.1)
##
## Call:
## lm(formula = inlf ~ nwifeinc + educ + exper + expersq + age +
## kidslt6 + kidsge6, data = mroz)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.93432 -0.37526 0.08833 0.34404 0.99417
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5855192 0.1541780 3.798 0.000158 ***
## nwifeinc -0.0034052 0.0014485 -2.351 0.018991 *
## educ 0.0379953 0.0073760 5.151 3.32e-07 ***
## exper 0.0394924 0.0056727 6.962 7.38e-12 ***
## expersq -0.0005963 0.0001848 -3.227 0.001306 **
## age -0.0160908 0.0024847 -6.476 1.71e-10 ***
## kidslt6 -0.2618105 0.0335058 -7.814 1.89e-14 ***
## kidsge6 0.0130122 0.0131960 0.986 0.324415
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4271 on 745 degrees of freedom
## Multiple R-squared: 0.2642, Adjusted R-squared: 0.2573
## F-statistic: 38.22 on 7 and 745 DF, p-value: < 2.2e-16
To calculated the percentage share of correct predictions I create dummy variables, which equal unity when the fitted values are above or equal to 0.5 and zero else. Next, I create a vector of dummies that are unity when both the fitted prediction and the observation of the dependent variable are unity. If they have different values, i.e. (1,0) or (0,1), the dummy is zero. The mean of this latter vector gives the percentage share of correctly predicted observations.
lm.fitted <- as.numeric(lm.17.1$fitted.values >= .5)
corr.pred.lm <- as.numeric(lm.fitted == mroz$inlf)
mean(corr.pred.lm)
## [1] 0.7343958
To estimate the Logit model, we cannot use the lm
function. Instead, we have to use the glm
command. Specify the estimation model, the dataset and the family of the estimation procedure, which in our case is logit. Estimate the model, save it and use the summary
command to get an overview of the results.
logit.17.1 <- glm(inlf ~ nwifeinc + educ + exper + expersq + age +
kidslt6 + kidsge6,
data = mroz,
family = binomial(link = "logit"))
summary(logit.17.1)
##
## Call:
## glm(formula = inlf ~ nwifeinc + educ + exper + expersq + age +
## kidslt6 + kidsge6, family = binomial(link = "logit"), data = mroz)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1770 -0.9063 0.4473 0.8561 2.4032
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.425452 0.860365 0.495 0.62095
## nwifeinc -0.021345 0.008421 -2.535 0.01126 *
## educ 0.221170 0.043439 5.091 3.55e-07 ***
## exper 0.205870 0.032057 6.422 1.34e-10 ***
## expersq -0.003154 0.001016 -3.104 0.00191 **
## age -0.088024 0.014573 -6.040 1.54e-09 ***
## kidslt6 -1.443354 0.203583 -7.090 1.34e-12 ***
## kidsge6 0.060112 0.074789 0.804 0.42154
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1029.75 on 752 degrees of freedom
## Residual deviance: 803.53 on 745 degrees of freedom
## AIC: 819.53
##
## Number of Fisher Scoring iterations: 4
The share of correctly predicted observations is retrieved in the same way as it was done with the linear model.
logit.fitted <- as.numeric(logit.17.1$fitted.values >= .5)
corr.pred.logit <- as.numeric(logit.fitted==mroz$inlf)
mean(corr.pred.logit)
## [1] 0.7357238
To get the log-likelihood, there is simple command, which just needs the estimated model output.
logLik(logit.17.1)
## 'log Lik.' -401.7652 (df=8)
Probit estimation works in the same way as Logit, except that you have to specify a different estimation procedure in the family argument.
probit.17.1 <- glm(inlf ~ nwifeinc + educ + exper + expersq + age +
kidslt6 + kidsge6,
data = mroz,
family = binomial(link = "probit"))
summary(probit.17.1)
##
## Call:
## glm(formula = inlf ~ nwifeinc + educ + exper + expersq + age +
## kidslt6 + kidsge6, family = binomial(link = "probit"), data = mroz)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2156 -0.9151 0.4315 0.8653 2.4553
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2700736 0.5080782 0.532 0.59503
## nwifeinc -0.0120236 0.0049392 -2.434 0.01492 *
## educ 0.1309040 0.0253987 5.154 2.55e-07 ***
## exper 0.1233472 0.0187587 6.575 4.85e-11 ***
## expersq -0.0018871 0.0005999 -3.145 0.00166 **
## age -0.0528524 0.0084624 -6.246 4.22e-10 ***
## kidslt6 -0.8683247 0.1183773 -7.335 2.21e-13 ***
## kidsge6 0.0360056 0.0440303 0.818 0.41350
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1029.7 on 752 degrees of freedom
## Residual deviance: 802.6 on 745 degrees of freedom
## AIC: 818.6
##
## Number of Fisher Scoring iterations: 4
probit.fitted <- as.numeric(probit.17.1$fitted.values >= .5)
corr.pred.probit <- as.numeric(probit.fitted==mroz$inlf)
mean(corr.pred.probit)
## [1] 0.7343958
Log-likelihood
logLik(probit.17.1)
## 'log Lik.' -401.3022 (df=8)
Example 17.2
data("mroz")
lm.17.2 <- lm.1 <- lm(hours ~ nwifeinc + educ + exper + expersq + age +
kidslt6 + kidsge6,
data = mroz)
summary(lm.17.2)
##
## Call:
## lm(formula = hours ~ nwifeinc + educ + exper + expersq + age +
## kidslt6 + kidsge6, data = mroz)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1511.3 -537.8 -146.9 538.1 3555.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1330.4824 270.7846 4.913 1.10e-06 ***
## nwifeinc -3.4466 2.5440 -1.355 0.1759
## educ 28.7611 12.9546 2.220 0.0267 *
## exper 65.6725 9.9630 6.592 8.23e-11 ***
## expersq -0.7005 0.3246 -2.158 0.0312 *
## age -30.5116 4.3639 -6.992 6.04e-12 ***
## kidslt6 -442.0899 58.8466 -7.513 1.66e-13 ***
## kidsge6 -32.7792 23.1762 -1.414 0.1577
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 750.2 on 745 degrees of freedom
## Multiple R-squared: 0.2656, Adjusted R-squared: 0.2587
## F-statistic: 38.5 on 7 and 745 DF, p-value: < 2.2e-16
For some reason, my results for sigma are slightly different from the results in Wooldridge (2013).
sd(lm.17.2$residuals)
## [1] 746.6789
To estimate a Tobit model, we have to load the AER
package. The rest is similar to linear regression methods. Just enter the formula and the data set and you will get your estimates.
tobit.17.2 <- tobit(hours ~ nwifeinc + educ + exper + expersq + age +
kidslt6 + kidsge6,
data = mroz)
summary(tobit.17.2)
##
## Call:
## tobit(formula = hours ~ nwifeinc + educ + exper + expersq + age +
## kidslt6 + kidsge6, data = mroz)
##
## Observations:
## Total Left-censored Uncensored Right-censored
## 753 325 428 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 965.30528 446.43614 2.162 0.030599 *
## nwifeinc -8.81424 4.45910 -1.977 0.048077 *
## educ 80.64561 21.58324 3.736 0.000187 ***
## exper 131.56430 17.27939 7.614 2.66e-14 ***
## expersq -1.86416 0.53766 -3.467 0.000526 ***
## age -54.40501 7.41850 -7.334 2.24e-13 ***
## kidslt6 -894.02174 111.87804 -7.991 1.34e-15 ***
## kidsge6 -16.21800 38.64139 -0.420 0.674701
## Log(scale) 7.02289 0.03706 189.514 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Scale: 1122
##
## Gaussian distribution
## Number of Newton-Raphson Iterations: 4
## Log-likelihood: -3819 on 9 Df
## Wald-statistic: 253.9 on 7 Df, p-value: < 2.22e-16
To obtain an R-squared value, calculate fitted values by inserting the (scaled 1122) linear predictors that are contained in the estimation results into equation 17.25 and calculate the square of the correlation between those fitted values and the real observations.
fitted <- pnorm(tobit.17.2$linear.predictors / 1122) * tobit.17.2$linear.predictors +
1122 * dnorm(tobit.17.2$linear.predictors / 1122)
cor(mroz$hours,fitted)^2
## [1] 0.274244
To get the average partial effect calculate the mean of the vector, which contains the values of the cumulative normal distribution function depending on the scaled linear predictors of the estimation:
ape <- mean(pnorm(tobit.17.2$linear.predictors / 1122))
ape * 80.65
## [1] 47.4758
For the partial effect at the average we generate a vector with unity as the first value and the means of all the independent variables in the model. Then, multiply this vector with the coefficients of the model to get linear predictions, scale them and calculate the value of the cumulative normal distribution function.
# PEA
means <- c(1, mean(mroz$nwifeinc), mean(mroz$educ), mean(mroz$exper), mean(mroz$exper)^2,
mean(mroz$age), mean(mroz$kidslt6), mean(mroz$kidsge6))
pea <- pnorm(tobit.17.2$coefficients %*% means / 1122)
pea * 80.65
## [,1]
## [1,] 52.03955
Example 17.3
Estimate the linear model
data("crime1")
lm.17.3 <- lm(narr86 ~ pcnv + avgsen + tottime + ptime86 +
qemp86 + inc86 + black + hispan + born60,
data = crime1)
summary(lm.17.3)
##
## Call:
## lm(formula = narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 +
## inc86 + black + hispan + born60, data = crime1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0210 -0.4544 -0.2389 0.2686 11.5207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.576566 0.037894 15.215 < 2e-16 ***
## pcnv -0.131886 0.040404 -3.264 0.001111 **
## avgsen -0.011332 0.012241 -0.926 0.354693
## tottime 0.012069 0.009436 1.279 0.201003
## ptime86 -0.040874 0.008813 -4.638 3.69e-06 ***
## qemp86 -0.051310 0.014486 -3.542 0.000404 ***
## inc86 -0.001462 0.000343 -4.261 2.10e-05 ***
## black 0.327010 0.045426 7.199 7.83e-13 ***
## hispan 0.193809 0.039716 4.880 1.12e-06 ***
## born60 -0.022465 0.033294 -0.675 0.499902
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8287 on 2715 degrees of freedom
## Multiple R-squared: 0.07248, Adjusted R-squared: 0.0694
## F-statistic: 23.57 on 9 and 2715 DF, p-value: < 2.2e-16
Again, the standard error deviates from the book.
sd(lm.17.3$residuals)
## [1] 0.8273599
To estimate the Poisson model we use the glm
function and specify the formula, dataset and the family of the estimation procedure, i.e. Poisson. After the estimation we get the summary of the results and the log-likelihood. The R-squared of the estimation is the squared correlation of the Poisson model’s fitted values and the real observations.
poisson.17.3 <- glm(narr86 ~ pcnv + avgsen + tottime + ptime86 +
qemp86 + inc86 + black + hispan + born60,
data = crime1,
family = poisson)
summary(poisson.17.3)
##
## Call:
## glm(formula = narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 +
## inc86 + black + hispan + born60, family = poisson, data = crime1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5731 -0.9076 -0.6651 0.2183 7.4609
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.599589 0.067250 -8.916 < 2e-16 ***
## pcnv -0.401571 0.084971 -4.726 2.29e-06 ***
## avgsen -0.023772 0.019946 -1.192 0.2333
## tottime 0.024490 0.014750 1.660 0.0969 .
## ptime86 -0.098558 0.020695 -4.763 1.91e-06 ***
## qemp86 -0.038019 0.029024 -1.310 0.1902
## inc86 -0.008081 0.001041 -7.762 8.34e-15 ***
## black 0.660838 0.073834 8.950 < 2e-16 ***
## hispan 0.499813 0.073927 6.761 1.37e-11 ***
## born60 -0.051029 0.064052 -0.797 0.4256
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3208.5 on 2724 degrees of freedom
## Residual deviance: 2822.2 on 2715 degrees of freedom
## AIC: 4517.5
##
## Number of Fisher Scoring iterations: 6
Log-likelihood
logLik(poisson.17.3)
## 'log Lik.' -2248.761 (df=10)
cor(poisson.17.3$fitted.values, crime1$narr86)^2
## [1] 0.07700391
The standard errors can be scaled either manually by multiplying the variance of the coefficient (squared standard error) by the scale factor and taking the square root or by specifying the dispersion factor in the summary
command.
(0.014750^2 * 1.232)^0.5
## [1] 0.01637184
summary(poisson.17.3, dispersion = 1.232)
##
## Call:
## glm(formula = narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 +
## inc86 + black + hispan + born60, family = poisson, data = crime1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5731 -0.9076 -0.6651 0.2183 7.4609
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.599589 0.074645 -8.033 9.54e-16 ***
## pcnv -0.401571 0.094314 -4.258 2.06e-05 ***
## avgsen -0.023772 0.022139 -1.074 0.283
## tottime 0.024490 0.016372 1.496 0.135
## ptime86 -0.098558 0.022970 -4.291 1.78e-05 ***
## qemp86 -0.038019 0.032216 -1.180 0.238
## inc86 -0.008081 0.001155 -6.993 2.68e-12 ***
## black 0.660838 0.081953 8.064 7.40e-16 ***
## hispan 0.499813 0.082055 6.091 1.12e-09 ***
## born60 -0.051029 0.071095 -0.718 0.473
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1.232)
##
## Null deviance: 3208.5 on 2724 degrees of freedom
## Residual deviance: 2822.2 on 2715 degrees of freedom
## AIC: 4517.5
##
## Number of Fisher Scoring iterations: 6
Example 17.4
For censored regression models we have to use the survival
package. The command Surv
prepares the dependent variable to be used in survreg
. It takes the variable “ldurat” as the interval. The argument event
tells Surv
whether a command is censored or not. I had to revert the variable cens, which is why I generated a further vector of dummies. The last option in Surv
tells if the data are censored to the left or to the right.
The independent variables and the data set are specified in the usual way. The distribution usually used is the Gaussian (referred to as “normal”) distribution.
data("recid")
surv.17.4 <- survreg(Surv(ldurat, event = as.numeric(cens!=1), type = "right") ~
workprg + priors + tserved + felon + alcohol + drugs +
black + married + educ + age,
dist = "gaussian",
data = recid)
summary(surv.17.4)
##
## Call:
## survreg(formula = Surv(ldurat, event = as.numeric(cens != 1),
## type = "right") ~ workprg + priors + tserved + felon + alcohol +
## drugs + black + married + educ + age, data = recid, dist = "gaussian")
## Value Std. Error z p
## (Intercept) 4.099386 0.347535 11.80 < 2e-16
## workprg -0.062572 0.120037 -0.52 0.6022
## priors -0.137253 0.021459 -6.40 1.6e-10
## tserved -0.019331 0.002978 -6.49 8.5e-11
## felon 0.443995 0.145087 3.06 0.0022
## alcohol -0.634909 0.144217 -4.40 1.1e-05
## drugs -0.298160 0.132736 -2.25 0.0247
## black -0.542718 0.117443 -4.62 3.8e-06
## married 0.340684 0.139843 2.44 0.0148
## educ 0.022920 0.025397 0.90 0.3668
## age 0.003910 0.000606 6.45 1.1e-10
## Log(scale) 0.593586 0.034412 17.25 < 2e-16
##
## Scale= 1.81
##
## Gaussian distribution
## Loglik(model)= -1597.1 Loglik(intercept only)= -1680.4
## Chisq= 166.74 on 10 degrees of freedom, p= 1.3e-30
## Number of Newton-Raphson Iterations: 4
## n= 1445
Example 17.5
Estimate the OLS model
data("mroz")
lm.17.5 <- lm(lwage ~ educ + exper + expersq, data = mroz)
summary(lm.17.5)
##
## Call:
## lm(formula = lwage ~ educ + exper + expersq, data = mroz)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.08404 -0.30627 0.04952 0.37498 2.37115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5220406 0.1986321 -2.628 0.00890 **
## educ 0.1074896 0.0141465 7.598 1.94e-13 ***
## exper 0.0415665 0.0131752 3.155 0.00172 **
## expersq -0.0008112 0.0003932 -2.063 0.03974 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6664 on 424 degrees of freedom
## (325 observations deleted due to missingness)
## Multiple R-squared: 0.1568, Adjusted R-squared: 0.1509
## F-statistic: 26.29 on 3 and 424 DF, p-value: 1.302e-15
To estimate the Heckit model we have to install the sampleSelection
package. It contains a neat function called heckit
, where we specify the formula for the model describing the causes of sample selection (first line) and the formula for the model of interest. Next, we specify the method used for estimation (two steps) and the sample.
heckit.17.5 <- heckit(inlf ~ educ + exper + expersq + nwifeinc + age + kidslt6 + kidsge6,
lwage ~ educ + exper + expersq,
method = "2step",
data = mroz)
summary(heckit.17.5)
## --------------------------------------------
## Tobit 2 model (sample selection model)
## 2-step Heckman / heckit estimation
## 753 observations (325 censored and 428 observed)
## 15 free parameters (df = 739)
## Probit selection equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.270077 0.508593 0.531 0.59556
## educ 0.130905 0.025254 5.183 2.81e-07 ***
## exper 0.123348 0.018716 6.590 8.34e-11 ***
## expersq -0.001887 0.000600 -3.145 0.00173 **
## nwifeinc -0.012024 0.004840 -2.484 0.01320 *
## age -0.052853 0.008477 -6.235 7.61e-10 ***
## kidslt6 -0.868328 0.118522 -7.326 6.21e-13 ***
## kidsge6 0.036005 0.043477 0.828 0.40786
## Outcome equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5781032 0.3050062 -1.895 0.05843 .
## educ 0.1090655 0.0155230 7.026 4.83e-12 ***
## exper 0.0438873 0.0162611 2.699 0.00712 **
## expersq -0.0008591 0.0004389 -1.957 0.05068 .
## Multiple R-Squared:0.1569, Adjusted R-Squared:0.149
## Error terms:
## Estimate Std. Error t value Pr(>|t|)
## invMillsRatio 0.03226 0.13362 0.241 0.809
## sigma 0.66363 NA NA NA
## rho 0.04861 NA NA NA
## --------------------------------------------