Statistical Inference
Before we start, let’s clear our memory, set the working directory and load the wooldridge
data package.
rm(list = ls())
setwd("C:/path/to/the/working/directory")
library(wooldridge)
data(wage1)
lm.1 <- lm(lwage ~ educ + exper + tenure, data = wage1)
s.1 <- summary(lm.1)
s.1
##
## Call:
## lm(formula = lwage ~ educ + exper + tenure, data = wage1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.05802 -0.29645 -0.03265 0.28788 1.42809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.284360 0.104190 2.729 0.00656 **
## educ 0.092029 0.007330 12.555 < 2e-16 ***
## exper 0.004121 0.001723 2.391 0.01714 *
## tenure 0.022067 0.003094 7.133 3.29e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4409 on 522 degrees of freedom
## Multiple R-squared: 0.316, Adjusted R-squared: 0.3121
## F-statistic: 80.39 on 3 and 522 DF, p-value: < 2.2e-16
In order to be able to use the values of the coefficients for further calculations, you can use copy and paste with the numbers provided in the summary s.1
. Alternatively, and for the sake of more precision, you can access the coefficients directly. If you use the names
function and run names(s.1)
, you will notice a subdirectory called coefficients, which contains the coefficients, standard errors, t-statistics and p-values of the regression. Entering s.1$coefficients
or coefficients(s.1)
gives those values for all coefficients, but you can also extract them separately using the [row, column]
signs, where the first value defines the row index and the second the column index. If no value is specified for a row or column, R will use all available values. This happens in the third line of the following code, where the t-values are calculated from the betas and their respective standard errors. The forth line of the code specifies a certain row which means that the t-value is calculated only for the parameter listed third in the coefficient list.
names(s.1)
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
s.1$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.284359541 0.104190379 2.729230 6.562466e-03
## educ 0.092028988 0.007329923 12.555246 8.824197e-32
## exper 0.004121109 0.001723277 2.391437 1.713562e-02
## tenure 0.022067218 0.003093649 7.133070 3.294407e-12
s.1$coefficient[, 1] / s.1$coefficient[, 2]
## (Intercept) educ exper tenure
## 2.729230 12.555246 2.391437 7.133070
s.1$coefficient[3, 1] / s.1$coefficient[3, 2]
## [1] 2.391437
The following code illustrates the differences in statistical significance that can result from the use of log-values.
Example 4.2
data("meap93")
lm.1 <- lm(math10 ~ totcomp + staff + enroll, data = meap93)
summary(lm.1)
##
## Call:
## lm(formula = math10 ~ totcomp + staff + enroll, data = meap93)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.235 -7.008 -0.807 6.097 40.689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2740209 6.1137938 0.372 0.710
## totcomp 0.0004586 0.0001004 4.570 6.49e-06 ***
## staff 0.0479199 0.0398140 1.204 0.229
## enroll -0.0001976 0.0002152 -0.918 0.359
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.24 on 404 degrees of freedom
## Multiple R-squared: 0.05406, Adjusted R-squared: 0.04704
## F-statistic: 7.697 on 3 and 404 DF, p-value: 5.179e-05
lm.2 <- lm(math10 ~ ltotcomp + lstaff + lenroll, data = meap93)
summary(lm.2)
##
## Call:
## lm(formula = math10 ~ ltotcomp + lstaff + lenroll, data = meap93)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.735 -6.838 -0.835 6.139 39.718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -207.6648 48.7031 -4.264 2.50e-05 ***
## ltotcomp 21.1550 4.0555 5.216 2.92e-07 ***
## lstaff 3.9800 4.1897 0.950 0.3427
## lenroll -1.2680 0.6932 -1.829 0.0681 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.18 on 404 degrees of freedom
## Multiple R-squared: 0.06538, Adjusted R-squared: 0.05844
## F-statistic: 9.42 on 3 and 404 DF, p-value: 4.974e-06
Example 4.3
data("gpa1")
lm.1 <- lm(colGPA ~ hsGPA + ACT + skipped, data = gpa1)
summary(lm.1)
##
## Call:
## lm(formula = colGPA ~ hsGPA + ACT + skipped, data = gpa1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.85698 -0.23200 -0.03935 0.24816 0.81657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.38955 0.33155 4.191 4.95e-05 ***
## hsGPA 0.41182 0.09367 4.396 2.19e-05 ***
## ACT 0.01472 0.01056 1.393 0.16578
## skipped -0.08311 0.02600 -3.197 0.00173 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3295 on 137 degrees of freedom
## Multiple R-squared: 0.2336, Adjusted R-squared: 0.2168
## F-statistic: 13.92 on 3 and 137 DF, p-value: 5.653e-08
The next examples illustrate how the t-statistic is calculated in R if you want to know whether a coefficient is different from a value other than zero. This works by obtaining the coefficients from the regression summary, subtracting the value against which it should be tested and diving the result by the standard error.
Example 4.4
data("campus")
lm.1 <- lm(lcrime ~ lenroll, data = campus)
s.1 <- summary(lm.1)
s.1
##
## Call:
## lm(formula = lcrime ~ lenroll, data = campus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5136 -0.3858 0.1174 0.4363 2.5782
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.6314 1.0335 -6.416 5.44e-09 ***
## lenroll 1.2698 0.1098 11.567 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8946 on 95 degrees of freedom
## Multiple R-squared: 0.5848, Adjusted R-squared: 0.5804
## F-statistic: 133.8 on 1 and 95 DF, p-value: < 2.2e-16
(s.1$coefficients[2,1] - 1) / s.1$coefficients[2,2]
## [1] 2.45737
Example 4.5
data("hprice2")
names(hprice2)
## [1] "price" "crime" "nox" "rooms" "dist" "radial"
## [7] "proptax" "stratio" "lowstat" "lprice" "lnox" "lproptax"
ldist <- log(hprice2$dist)
lm.1 <- lm(lprice ~ lnox + ldist + rooms + stratio, data = hprice2)
s.1 <- summary(lm.1)
s.1
##
## Call:
## lm(formula = lprice ~ lnox + ldist + rooms + stratio, data = hprice2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.05890 -0.12427 0.02128 0.12882 1.32531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.083862 0.318111 34.843 < 2e-16 ***
## lnox -0.953539 0.116742 -8.168 2.57e-15 ***
## ldist -0.134339 0.043103 -3.117 0.00193 **
## rooms 0.254527 0.018530 13.736 < 2e-16 ***
## stratio -0.052451 0.005897 -8.894 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.265 on 501 degrees of freedom
## Multiple R-squared: 0.584, Adjusted R-squared: 0.5807
## F-statistic: 175.9 on 4 and 501 DF, p-value: < 2.2e-16
(s.1$coefficients[2, 1] + 1) / s.1$coefficients[2, 2]
## [1] 0.3979827
Example 4.6
data("k401k")
lm.1 <- lm(prate ~ mrate + age + totemp, data = k401k)
summary(lm.1)
##
## Call:
## lm(formula = prate ~ mrate + age + totemp, data = k401k)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77.698 -8.074 4.716 12.505 30.307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.029e+01 7.777e-01 103.242 < 2e-16 ***
## mrate 5.442e+00 5.244e-01 10.378 < 2e-16 ***
## age 2.692e-01 4.514e-02 5.963 3.07e-09 ***
## totemp -1.291e-04 3.666e-05 -3.521 0.000443 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.88 on 1530 degrees of freedom
## Multiple R-squared: 0.09954, Adjusted R-squared: 0.09778
## F-statistic: 56.38 on 3 and 1530 DF, p-value: < 2.2e-16
The following code limits the used data to those observations, where the year is 1987 and the value for union is zero. The specification of data =
tells R to take only those observations into consideration which meet these conditions. You could read this command in a way that R takes data from jtrain
, where in each row - note that the second part of the [,]
sign is empty! - the conditions x and y have to hold.
Example 4.7
data("jtrain")
lm.1 <- lm(lscrap ~ hrsemp + lsales + lemploy, data = jtrain[jtrain$year==1987 & jtrain$union==0,])
summary(lm.1)
##
## Call:
## lm(formula = lscrap ~ hrsemp + lsales + lemploy, data = jtrain[jtrain$year ==
## 1987 & jtrain$union == 0, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6301 -0.7523 -0.4016 0.8697 2.8273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.45837 5.68677 2.191 0.0380 *
## hrsemp -0.02927 0.02280 -1.283 0.2111
## lsales -0.96203 0.45252 -2.126 0.0436 *
## lemploy 0.76147 0.40743 1.869 0.0734 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.376 on 25 degrees of freedom
## (97 observations deleted due to missingness)
## Multiple R-squared: 0.2624, Adjusted R-squared: 0.1739
## F-statistic: 2.965 on 3 and 25 DF, p-value: 0.05134
The next example shows how you can calculate confidence intervals from estimation results. For this purpose you can use the confint
function which needs the stored regression results (here lm.1
) and the level of confidence specified by the argument level = something
.
Example 4.8
data("rdchem")
lm.1 <- lm(lrd ~ lsales + profmarg, data = rdchem)
summary(lm.1)
##
## Call:
## lm(formula = lrd ~ lsales + profmarg, data = rdchem)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.97681 -0.31502 -0.05828 0.39020 1.21783
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.37827 0.46802 -9.355 2.93e-10 ***
## lsales 1.08422 0.06020 18.012 < 2e-16 ***
## profmarg 0.02166 0.01278 1.694 0.101
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5136 on 29 degrees of freedom
## Multiple R-squared: 0.918, Adjusted R-squared: 0.9123
## F-statistic: 162.2 on 2 and 29 DF, p-value: < 2.2e-16
# 10% confidence
confint(lm.1, level = .90)
## 5 % 95 %
## (Intercept) -5.173496e+00 -3.583051
## lsales 9.819409e-01 1.186499
## profmarg -6.362296e-05 0.043375
# 5% confidence
confint(lm.1, level = .95)
## 2.5 % 97.5 %
## (Intercept) -5.335478679 -3.42106808
## lsales 0.961107260 1.20733251
## profmarg -0.004487725 0.04779911
Equations 4.21 & 4.27
data("twoyear")
lm.1 <- lm(lwage ~ jc + univ + exper, data = twoyear)
s.1 <- summary(lm.1)
lm.2 <- lm(lwage ~ jc + totcoll + exper, data = twoyear)
s.2 <- summary(lm.2)
(s.1$coefficient[2, 1] - s.1$coefficient[3, 1]) / s.2$coefficient[2, 2]
## [1] -1.467657
Next, we want to calculate the F-test to evaluate the joint significance of independent variables. For this purpose you have to exert the regression of the restricted and the unrestricted model and store the results. Then, use the anova
function. Specify which two regressions have to be used and let R do the rest.
Equations 4.31 & 4.33
data("mlb1")
# unrestricted model
lm.1 <- lm(lsalary ~ years + gamesyr + bavg + hrunsyr + rbisyr, data = mlb1)
# restricted model
lm.2 <- lm(lsalary ~ years + gamesyr, data = mlb1)
# f-test
anova(lm.1, lm.2)
## Analysis of Variance Table
##
## Model 1: lsalary ~ years + gamesyr + bavg + hrunsyr + rbisyr
## Model 2: lsalary ~ years + gamesyr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 347 183.19
## 2 350 198.31 -3 -15.125 9.5503 4.474e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Example 4.9
data("bwght")
lm.1 <- lm(bwght ~ cigs + parity + faminc + motheduc + fatheduc, data = bwght)
lm.2 <- lm(bwght ~ cigs + parity + faminc, data = bwght[is.na(bwght$motheduc) != TRUE & is.na(bwght$fatheduc) != TRUE,])
anova(lm.1, lm.2)
## Analysis of Variance Table
##
## Model 1: bwght ~ cigs + parity + faminc + motheduc + fatheduc
## Model 2: bwght ~ cigs + parity + faminc
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1185 464041
## 2 1187 465167 -2 -1125.7 1.4373 0.238
Equation 4.48
data("hprice1")
lm.1 <- lm(lprice ~ lassess + llotsize + lsqrft + bdrms, data = hprice1)
lprice <- hprice1$lprice - hprice1$lassess
lm.2 <- lm(lprice ~ 1)
anova(lm.1, lm.2)
## Analysis of Variance Table
##
## Model 1: lprice ~ lassess + llotsize + lsqrft + bdrms
## Model 2: lprice ~ 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 83 1.8215
## 2 87 1.8801 -4 -0.05862 0.6678 0.6162
Example 4.10
data("meap93")
bs <- meap93$benefits / meap93$salary
lm.1 <- lm(lsalary ~ bs, data = meap93)
lm.2 <- lm(lsalary ~ bs + lenroll + lstaff, data = meap93)
lm.3 <- lm(lsalary ~ bs + lenroll + lstaff + droprate + gradrate, data = meap93)
summary(lm.1)
##
## Call:
## lm(formula = lsalary ~ bs, data = meap93)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.43386 -0.10537 -0.00963 0.09673 0.51864
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.52318 0.04156 253.203 < 2e-16 ***
## bs -0.82540 0.19989 -4.129 4.42e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1514 on 406 degrees of freedom
## Multiple R-squared: 0.0403, Adjusted R-squared: 0.03794
## F-statistic: 17.05 on 1 and 406 DF, p-value: 4.422e-05
summary(lm.2)
##
## Call:
## lm(formula = lsalary ~ bs + lenroll + lstaff, data = meap93)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29251 -0.08885 -0.01611 0.07210 0.39151
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.843840 0.251643 43.092 < 2e-16 ***
## bs -0.604772 0.165368 -3.657 0.000289 ***
## lenroll 0.087397 0.007346 11.897 < 2e-16 ***
## lstaff -0.222033 0.050077 -4.434 1.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1246 on 404 degrees of freedom
## Multiple R-squared: 0.3527, Adjusted R-squared: 0.3479
## F-statistic: 73.39 on 3 and 404 DF, p-value: < 2.2e-16
summary(lm.3)
##
## Call:
## lm(formula = lsalary ~ bs + lenroll + lstaff + droprate + gradrate,
## data = meap93)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28711 -0.08902 -0.01949 0.07255 0.39754
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.7384651 0.2582651 41.579 < 2e-16 ***
## bs -0.5893195 0.1648739 -3.574 0.000394 ***
## lenroll 0.0881204 0.0073240 12.032 < 2e-16 ***
## lstaff -0.2182779 0.0499503 -4.370 1.58e-05 ***
## droprate -0.0002826 0.0016145 -0.175 0.861113
## gradrate 0.0009674 0.0006625 1.460 0.145032
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1241 on 402 degrees of freedom
## Multiple R-squared: 0.361, Adjusted R-squared: 0.3531
## F-statistic: 45.43 on 5 and 402 DF, p-value: < 2.2e-16