Chapter 17: Limited Dependent Variable Models and Sample Selection Corrections

Franz X. Mohr, Created: October 7, 2018, Last update: October 7, 2018

Example 17.1

First, look at the linear model.

data("mroz")

lm.17.1 <- lm(inlf ~ nwifeinc + educ + exper + expersq + age +
                kidslt6 + kidsge6, data = mroz)
summary(lm.17.1)

## 
## Call:
## lm(formula = inlf ~ nwifeinc + educ + exper + expersq + age + 
##     kidslt6 + kidsge6, data = mroz)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93432 -0.37526  0.08833  0.34404  0.99417 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.5855192  0.1541780   3.798 0.000158 ***
## nwifeinc    -0.0034052  0.0014485  -2.351 0.018991 *  
## educ         0.0379953  0.0073760   5.151 3.32e-07 ***
## exper        0.0394924  0.0056727   6.962 7.38e-12 ***
## expersq     -0.0005963  0.0001848  -3.227 0.001306 ** 
## age         -0.0160908  0.0024847  -6.476 1.71e-10 ***
## kidslt6     -0.2618105  0.0335058  -7.814 1.89e-14 ***
## kidsge6      0.0130122  0.0131960   0.986 0.324415    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4271 on 745 degrees of freedom
## Multiple R-squared:  0.2642, Adjusted R-squared:  0.2573 
## F-statistic: 38.22 on 7 and 745 DF,  p-value: < 2.2e-16

To calculated the percentage share of correct predictions I create dummy variables, which equal unity when the fitted values are above or equal to 0.5 and zero else. Next, I create a vector of dummies that are unity when both the fitted prediction and the observation of the dependent variable are unity. If they have different values, i.e. (1,0) or (0,1), the dummy is zero. The mean of this latter vector gives the percentage share of correctly predicted observations.

lm.fitted <- as.numeric(lm.17.1$fitted.values >= .5)
corr.pred.lm <- as.numeric(lm.fitted == mroz$inlf)

mean(corr.pred.lm)

## [1] 0.7343958

To estimate the Logit model, we cannot use the lm function. Instead, we have to use the glm command. Specify the estimation model, the dataset and the family of the estimation procedure, which in our case is logit. Estimate the model, save it and use the summary command to get an overview of the results.

logit.17.1 <- glm(inlf ~ nwifeinc + educ + exper + expersq + age +
                    kidslt6 + kidsge6,
                  data = mroz,
                  family = binomial(link = "logit"))
summary(logit.17.1)

## 
## Call:
## glm(formula = inlf ~ nwifeinc + educ + exper + expersq + age + 
##     kidslt6 + kidsge6, family = binomial(link = "logit"), data = mroz)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1770  -0.9063   0.4473   0.8561   2.4032  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.425452   0.860365   0.495  0.62095    
## nwifeinc    -0.021345   0.008421  -2.535  0.01126 *  
## educ         0.221170   0.043439   5.091 3.55e-07 ***
## exper        0.205870   0.032057   6.422 1.34e-10 ***
## expersq     -0.003154   0.001016  -3.104  0.00191 ** 
## age         -0.088024   0.014573  -6.040 1.54e-09 ***
## kidslt6     -1.443354   0.203583  -7.090 1.34e-12 ***
## kidsge6      0.060112   0.074789   0.804  0.42154    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1029.75  on 752  degrees of freedom
## Residual deviance:  803.53  on 745  degrees of freedom
## AIC: 819.53
## 
## Number of Fisher Scoring iterations: 4

The share of correctly predicted observations is retrieved in the same way as it was done with the linear model.

logit.fitted <- as.numeric(logit.17.1$fitted.values >= .5)
corr.pred.logit <- as.numeric(logit.fitted==mroz$inlf)

mean(corr.pred.logit)

## [1] 0.7357238

To get the log-likelihood, there is simple command, which just needs the estimated model output.

logLik(logit.17.1)

## 'log Lik.' -401.7652 (df=8)

Probit estimation works in the same way as Logit, except that you have to specify a different estimation procedure in the family argument.

probit.17.1 <- glm(inlf ~ nwifeinc + educ + exper + expersq + age +
                     kidslt6 + kidsge6,
                   data = mroz,
                   family = binomial(link = "probit"))
summary(probit.17.1)

## 
## Call:
## glm(formula = inlf ~ nwifeinc + educ + exper + expersq + age + 
##     kidslt6 + kidsge6, family = binomial(link = "probit"), data = mroz)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2156  -0.9151   0.4315   0.8653   2.4553  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.2700736  0.5080782   0.532  0.59503    
## nwifeinc    -0.0120236  0.0049392  -2.434  0.01492 *  
## educ         0.1309040  0.0253987   5.154 2.55e-07 ***
## exper        0.1233472  0.0187587   6.575 4.85e-11 ***
## expersq     -0.0018871  0.0005999  -3.145  0.00166 ** 
## age         -0.0528524  0.0084624  -6.246 4.22e-10 ***
## kidslt6     -0.8683247  0.1183773  -7.335 2.21e-13 ***
## kidsge6      0.0360056  0.0440303   0.818  0.41350    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1029.7  on 752  degrees of freedom
## Residual deviance:  802.6  on 745  degrees of freedom
## AIC: 818.6
## 
## Number of Fisher Scoring iterations: 4

probit.fitted <- as.numeric(probit.17.1$fitted.values >= .5)
corr.pred.probit <- as.numeric(probit.fitted==mroz$inlf)

mean(corr.pred.probit)

## [1] 0.7343958

Log-likelihood

logLik(probit.17.1)

## 'log Lik.' -401.3022 (df=8)

Example 17.2

data("mroz")

lm.17.2 <- lm.1 <- lm(hours ~ nwifeinc + educ + exper + expersq + age +
                        kidslt6 + kidsge6,
                      data = mroz)
summary(lm.17.2)

## 
## Call:
## lm(formula = hours ~ nwifeinc + educ + exper + expersq + age + 
##     kidslt6 + kidsge6, data = mroz)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1511.3  -537.8  -146.9   538.1  3555.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1330.4824   270.7846   4.913 1.10e-06 ***
## nwifeinc      -3.4466     2.5440  -1.355   0.1759    
## educ          28.7611    12.9546   2.220   0.0267 *  
## exper         65.6725     9.9630   6.592 8.23e-11 ***
## expersq       -0.7005     0.3246  -2.158   0.0312 *  
## age          -30.5116     4.3639  -6.992 6.04e-12 ***
## kidslt6     -442.0899    58.8466  -7.513 1.66e-13 ***
## kidsge6      -32.7792    23.1762  -1.414   0.1577    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 750.2 on 745 degrees of freedom
## Multiple R-squared:  0.2656, Adjusted R-squared:  0.2587 
## F-statistic:  38.5 on 7 and 745 DF,  p-value: < 2.2e-16

For some reason, my results for sigma are slightly different from the results in Wooldridge (2013).

sd(lm.17.2$residuals)

## [1] 746.6789

To estimate a Tobit model, we have to load the AER package. The rest is similar to linear regression methods. Just enter the formula and the data set and you will get your estimates.

tobit.17.2 <- tobit(hours ~ nwifeinc + educ + exper + expersq + age +
                      kidslt6 + kidsge6,
                    data = mroz)
summary(tobit.17.2)

## 
## Call:
## tobit(formula = hours ~ nwifeinc + educ + exper + expersq + age + 
##     kidslt6 + kidsge6, data = mroz)
## 
## Observations:
##          Total  Left-censored     Uncensored Right-censored 
##            753            325            428              0 
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  965.30528  446.43614   2.162 0.030599 *  
## nwifeinc      -8.81424    4.45910  -1.977 0.048077 *  
## educ          80.64561   21.58324   3.736 0.000187 ***
## exper        131.56430   17.27939   7.614 2.66e-14 ***
## expersq       -1.86416    0.53766  -3.467 0.000526 ***
## age          -54.40501    7.41850  -7.334 2.24e-13 ***
## kidslt6     -894.02174  111.87804  -7.991 1.34e-15 ***
## kidsge6      -16.21800   38.64139  -0.420 0.674701    
## Log(scale)     7.02289    0.03706 189.514  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Scale: 1122 
## 
## Gaussian distribution
## Number of Newton-Raphson Iterations: 4 
## Log-likelihood: -3819 on 9 Df
## Wald-statistic: 253.9 on 7 Df, p-value: < 2.22e-16

To obtain an R-squared value, calculate fitted values by inserting the (scaled 1122) linear predictors that are contained in the estimation results into equation 17.25 and calculate the square of the correlation between those fitted values and the real observations.

fitted <- pnorm(tobit.17.2$linear.predictors / 1122) * tobit.17.2$linear.predictors +
  1122 * dnorm(tobit.17.2$linear.predictors / 1122)

cor(mroz$hours,fitted)^2

## [1] 0.274244

To get the average partial effect calculate the mean of the vector, which contains the values of the cumulative normal distribution function depending on the scaled linear predictors of the estimation:

ape <- mean(pnorm(tobit.17.2$linear.predictors / 1122))
ape * 80.65

## [1] 47.4758

For the partial effect at the average we generate a vector with unity as the first value and the means of all the independent variables in the model. Then, multiply this vector with the coefficients of the model to get linear predictions, scale them and calculate the value of the cumulative normal distribution function.

# PEA
means <- c(1, mean(mroz$nwifeinc), mean(mroz$educ), mean(mroz$exper), mean(mroz$exper)^2,
           mean(mroz$age), mean(mroz$kidslt6), mean(mroz$kidsge6))
pea <- pnorm(tobit.17.2$coefficients %*% means / 1122)
pea * 80.65

##          [,1]
## [1,] 52.03955

Example 17.3

Estimate the linear model

data("crime1")

lm.17.3 <- lm(narr86 ~ pcnv + avgsen + tottime + ptime86 + 
                qemp86 + inc86 + black + hispan + born60,
              data = crime1)
summary(lm.17.3)

## 
## Call:
## lm(formula = narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 + 
##     inc86 + black + hispan + born60, data = crime1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0210 -0.4544 -0.2389  0.2686 11.5207 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.576566   0.037894  15.215  < 2e-16 ***
## pcnv        -0.131886   0.040404  -3.264 0.001111 ** 
## avgsen      -0.011332   0.012241  -0.926 0.354693    
## tottime      0.012069   0.009436   1.279 0.201003    
## ptime86     -0.040874   0.008813  -4.638 3.69e-06 ***
## qemp86      -0.051310   0.014486  -3.542 0.000404 ***
## inc86       -0.001462   0.000343  -4.261 2.10e-05 ***
## black        0.327010   0.045426   7.199 7.83e-13 ***
## hispan       0.193809   0.039716   4.880 1.12e-06 ***
## born60      -0.022465   0.033294  -0.675 0.499902    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8287 on 2715 degrees of freedom
## Multiple R-squared:  0.07248,    Adjusted R-squared:  0.0694 
## F-statistic: 23.57 on 9 and 2715 DF,  p-value: < 2.2e-16

Again, the standard error deviates from the book.

sd(lm.17.3$residuals)

## [1] 0.8273599

To estimate the Poisson model we use the glm function and specify the formula, dataset and the family of the estimation procedure, i.e. Poisson. After the estimation we get the summary of the results and the log-likelihood. The R-squared of the estimation is the squared correlation of the Poisson model’s fitted values and the real observations.

poisson.17.3 <- glm(narr86 ~ pcnv + avgsen + tottime + ptime86 + 
                      qemp86 + inc86 + black + hispan + born60,
                    data = crime1,
                    family = poisson)
summary(poisson.17.3)

## 
## Call:
## glm(formula = narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 + 
##     inc86 + black + hispan + born60, family = poisson, data = crime1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5731  -0.9076  -0.6651   0.2183   7.4609  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.599589   0.067250  -8.916  < 2e-16 ***
## pcnv        -0.401571   0.084971  -4.726 2.29e-06 ***
## avgsen      -0.023772   0.019946  -1.192   0.2333    
## tottime      0.024490   0.014750   1.660   0.0969 .  
## ptime86     -0.098558   0.020695  -4.763 1.91e-06 ***
## qemp86      -0.038019   0.029024  -1.310   0.1902    
## inc86       -0.008081   0.001041  -7.762 8.34e-15 ***
## black        0.660838   0.073834   8.950  < 2e-16 ***
## hispan       0.499813   0.073927   6.761 1.37e-11 ***
## born60      -0.051029   0.064052  -0.797   0.4256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3208.5  on 2724  degrees of freedom
## Residual deviance: 2822.2  on 2715  degrees of freedom
## AIC: 4517.5
## 
## Number of Fisher Scoring iterations: 6

Log-likelihood

logLik(poisson.17.3)

## 'log Lik.' -2248.761 (df=10)

cor(poisson.17.3$fitted.values, crime1$narr86)^2

## [1] 0.07700391

The standard errors can be scaled either manually by multiplying the variance of the coefficient (squared standard error) by the scale factor and taking the square root or by specifying the dispersion factor in the summary command.

(0.014750^2 * 1.232)^0.5

## [1] 0.01637184

summary(poisson.17.3, dispersion = 1.232)

## 
## Call:
## glm(formula = narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86 + 
##     inc86 + black + hispan + born60, family = poisson, data = crime1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5731  -0.9076  -0.6651   0.2183   7.4609  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.599589   0.074645  -8.033 9.54e-16 ***
## pcnv        -0.401571   0.094314  -4.258 2.06e-05 ***
## avgsen      -0.023772   0.022139  -1.074    0.283    
## tottime      0.024490   0.016372   1.496    0.135    
## ptime86     -0.098558   0.022970  -4.291 1.78e-05 ***
## qemp86      -0.038019   0.032216  -1.180    0.238    
## inc86       -0.008081   0.001155  -6.993 2.68e-12 ***
## black        0.660838   0.081953   8.064 7.40e-16 ***
## hispan       0.499813   0.082055   6.091 1.12e-09 ***
## born60      -0.051029   0.071095  -0.718    0.473    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1.232)
## 
##     Null deviance: 3208.5  on 2724  degrees of freedom
## Residual deviance: 2822.2  on 2715  degrees of freedom
## AIC: 4517.5
## 
## Number of Fisher Scoring iterations: 6

Example 17.4

For censored regression models we have to use the survival package. The command Surv prepares the dependent variable to be used in survreg. It takes the variable “ldurat” as the interval. The argument event tells Surv whether a command is censored or not. I had to revert the variable cens, which is why I generated a further vector of dummies. The last option in Surv tells if the data are censored to the left or to the right.

The independent variables and the data set are specified in the usual way. The distribution usually used is the Gaussian (referred to as “normal”) distribution.

data("recid")

surv.17.4 <- survreg(Surv(ldurat, event = as.numeric(cens!=1), type = "right") ~ 
                       workprg + priors + tserved + felon + alcohol + drugs +
                       black + married + educ + age,
                     dist = "gaussian",
                     data = recid)
summary(surv.17.4)

## 
## Call:
## survreg(formula = Surv(ldurat, event = as.numeric(cens != 1), 
##     type = "right") ~ workprg + priors + tserved + felon + alcohol + 
##     drugs + black + married + educ + age, data = recid, dist = "gaussian")
##                 Value Std. Error     z       p
## (Intercept)  4.099386   0.347535 11.80 < 2e-16
## workprg     -0.062572   0.120037 -0.52  0.6022
## priors      -0.137253   0.021459 -6.40 1.6e-10
## tserved     -0.019331   0.002978 -6.49 8.5e-11
## felon        0.443995   0.145087  3.06  0.0022
## alcohol     -0.634909   0.144217 -4.40 1.1e-05
## drugs       -0.298160   0.132736 -2.25  0.0247
## black       -0.542718   0.117443 -4.62 3.8e-06
## married      0.340684   0.139843  2.44  0.0148
## educ         0.022920   0.025397  0.90  0.3668
## age          0.003910   0.000606  6.45 1.1e-10
## Log(scale)   0.593586   0.034412 17.25 < 2e-16
## 
## Scale= 1.81 
## 
## Gaussian distribution
## Loglik(model)= -1597.1   Loglik(intercept only)= -1680.4
##  Chisq= 166.74 on 10 degrees of freedom, p= 1.3e-30 
## Number of Newton-Raphson Iterations: 4 
## n= 1445

Example 17.5

Estimate the OLS model

data("mroz")

lm.17.5 <- lm(lwage ~ educ + exper + expersq, data = mroz)
summary(lm.17.5)

## 
## Call:
## lm(formula = lwage ~ educ + exper + expersq, data = mroz)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.08404 -0.30627  0.04952  0.37498  2.37115 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.5220406  0.1986321  -2.628  0.00890 ** 
## educ         0.1074896  0.0141465   7.598 1.94e-13 ***
## exper        0.0415665  0.0131752   3.155  0.00172 ** 
## expersq     -0.0008112  0.0003932  -2.063  0.03974 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6664 on 424 degrees of freedom
##   (325 observations deleted due to missingness)
## Multiple R-squared:  0.1568, Adjusted R-squared:  0.1509 
## F-statistic: 26.29 on 3 and 424 DF,  p-value: 1.302e-15

To estimate the Heckit model we have to install the sampleSelection package. It contains a neat function called heckit, where we specify the formula for the model describing the causes of sample selection (first line) and the formula for the model of interest. Next, we specify the method used for estimation (two steps) and the sample.

heckit.17.5 <- heckit(inlf ~ educ + exper + expersq + nwifeinc + age + kidslt6 + kidsge6,
                      lwage ~ educ + exper + expersq,
                      method = "2step",
                      data = mroz)
summary(heckit.17.5)

## --------------------------------------------
## Tobit 2 model (sample selection model)
## 2-step Heckman / heckit estimation
## 753 observations (325 censored and 428 observed)
## 15 free parameters (df = 739)
## Probit selection equation:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.270077   0.508593   0.531  0.59556    
## educ         0.130905   0.025254   5.183 2.81e-07 ***
## exper        0.123348   0.018716   6.590 8.34e-11 ***
## expersq     -0.001887   0.000600  -3.145  0.00173 ** 
## nwifeinc    -0.012024   0.004840  -2.484  0.01320 *  
## age         -0.052853   0.008477  -6.235 7.61e-10 ***
## kidslt6     -0.868328   0.118522  -7.326 6.21e-13 ***
## kidsge6      0.036005   0.043477   0.828  0.40786    
## Outcome equation:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.5781032  0.3050062  -1.895  0.05843 .  
## educ         0.1090655  0.0155230   7.026 4.83e-12 ***
## exper        0.0438873  0.0162611   2.699  0.00712 ** 
## expersq     -0.0008591  0.0004389  -1.957  0.05068 .  
## Multiple R-Squared:0.1569,   Adjusted R-Squared:0.149
##    Error terms:
##               Estimate Std. Error t value Pr(>|t|)
## invMillsRatio  0.03226    0.13362   0.241    0.809
## sigma          0.66363         NA      NA       NA
## rho            0.04861         NA      NA       NA
## --------------------------------------------