Applied econometrics: Econ 508
logo

Applied Econometrics 
Econ 508 - Fall 2014

Professor: Roger Koenker 

TA: Nicolas Bottan 

Welcome to a new issue of e-Tutorial. This e-TA will focus on Binary Data Models, with special enphasis on Logit and Probit regression models.1

Data

You can download the data set, called weco14.txt from the Econ 508 web site. Save it in your preferred directory.

Then you can load it in R. If you saved it in your hard drive in a folder called econ508 you can load it by typing:

weco<-read.table("C:/econ508/weco14.txt", header=T, sep="")

or you can retrieve it from the web

weco<-read.table("http://www.econ.uiuc.edu/~econ508/data/weco14.txt", header=T, sep="")
head(weco)
         y sex dex lex  kwit job_tenure status
1 13.72579   0  38  10 FALSE        277   TRUE
2 17.15373   1  55  11  TRUE        173   TRUE
3 13.62992   1  45  12 FALSE        410   TRUE
4 13.03998   1  41  11 FALSE        247   TRUE
5 13.19510   1  42  10 FALSE        340   TRUE
6 16.43010   0  38  12 FALSE        517   TRUE

Logit estimation

Here we estimate the same logit model proposed in question 3 of PS5

\[ logit(P(quit=1))= \beta_{0} + \beta_{1} sex + \beta_{2} dex +\beta_{3} lex + \beta_{4} lex^2 \]

where \(quit=1\) if the worker wuit withing the first 6 mothns after employment, and 0 otherwise.

To estimate this model first we need to create \(lex^2\), and it may be also conveniente to transform the kwit and status variables to a binary variable.

weco$lex2<-weco$lex^2
weco$kwit<-as.numeric(weco$kwit)
weco$status<-as.numeric(weco$status)

then we are ready to estimate the model. To do so we use the glm function and specify the link function we want to use, in this case logit

logit <- glm(kwit~sex+dex+lex+lex2,data=weco, family=binomial(link="logit"))
summary(logit)

Call:
glm(formula = kwit ~ sex + dex + lex + lex2, family = binomial(link = "logit"), 
    data = weco)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3503  -0.7475  -0.5651  -0.3256   2.4173  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 14.98417    4.08046   3.672  0.00024 ***
sex          0.46552    0.19641   2.370  0.01778 *  
dex         -0.10201    0.01475  -6.915 4.69e-12 ***
lex         -1.96747    0.63987  -3.075  0.00211 ** 
lex2         0.07997    0.02555   3.130  0.00175 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 745.97  on 682  degrees of freedom
Residual deviance: 678.42  on 678  degrees of freedom
AIC: 688.42

Number of Fisher Scoring iterations: 4

This is equivalent to estimate:

\[ Pr(kwit=1)=\frac{exp(x_{j} \beta)}{(1+exp(x_{j}\beta))} \]

The results above show that, coeteris paribus, workers with higher dexterity are less likely to quit, while males (sex=1) have a bigger tendency to quit than females (sex=0). In other words, positive coefficients contribute to increase the probability of quitting, while negative coefficients, to reduce it.

To draw a picture of the probability of quitting as a function of years of education, holding everything else constant (e.g., at their mean value), you need first to ask for mean of dexterity for the pooled sample, and then you can do it for males and females.

mean_dex<-mean((weco$dex))
mean_dex_male<-mean((weco$dex[weco$sex==1]))
mean_dex_fem<-mean((weco$dex[weco$sex==0]))

Thus, let’s ask for the expected value of the probability of quitting conditioned to dexterity being hold at the pooled mean value:

weco2<-data.frame(mean(weco$sex),mean(weco$dex),weco$lex, weco$lex2)
names(weco2)<-c("sex","dex","lex","lex2")
p_hat<-predict(logit, weco2, type="response")
plot(weco$lex,p_hat, type="p")

You can do this for males and females also. Next you are asked to examine better the effect of gender. A first suggestion is to tabulate sex and kwit:

table(weco$sex,weco$kwit)
   
      0   1
  0 235  61
  1 287 100

In the table above you can see that 61 out of the existing 926 females in this sample are quitters, while 100 out of 387 males are quitters. So, the proportion of male quitters (56.6%) is greater than the proportion of female quitters (43.3%).

Next you can draw graphs of the expected probability of quitting for each gender, using their respective dexterity means. In R, we can ask for the graphs as follows:

weco_male<-data.frame(1,mean(weco$dex[weco$sex==1]),weco$lex, weco$lex2)
names(weco_male)<-c("sex","dex","lex","lex2")
p_hat_m<-predict(logit, weco_male, type="response")
weco_fem<-data.frame(0,mean(weco$dex[weco$sex==0]),weco$lex, weco$lex2)
names(weco_fem)<-c("sex","dex","lex","lex2")
p_hat_fem<-predict(logit, weco_fem, type="response")

plot(weco$lex,p_hat_fem, type="p", col="red")
points(weco$lex,p_hat_m, type="p", col="blue")

Besides the graphical analysis, you can also test for the shape of the education effect, by introducing the variables \(sex*lex\) and \(sex*lex2\) in the model, and checking their significance.

Next you are asked to evaluate the Logit specification by computing the Pregibon diagnostic. The first step is to generate the predicted probabilities of quitting, called p, and compute \(g_a\) (parameter that controls the fatness of tails) and \(g_d\) (parameter that controls symmetry):

p<-predict(logit, weco, type="response")
weco$ga<-.5*(log(p)*log(p) - log(1-p)*log(1-p)) 
weco$gd<- -.5*(log(p)*log(p) + log(1-p)*log(1-p))

Finally you run an extended Logit model, including the variables \(g_a\) and \(g_d\):

logit_preg <- glm(kwit~sex+dex+lex+lex2+ga+gd,data=weco, family=binomial(link="logit"))
summary(logit_preg)

Call:
glm(formula = kwit ~ sex + dex + lex + lex2 + ga + gd, family = binomial(link = "logit"), 
    data = weco)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6683  -0.7188  -0.5783  -0.3633   2.3293  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.084292  17.749214  -0.061    0.951
sex          0.104829   0.513235   0.204    0.838
dex         -0.027108   0.106906  -0.254    0.800
lex          0.185746   2.322956   0.080    0.936
lex2        -0.008141   0.094513  -0.086    0.931
ga          -2.348950   1.946927  -1.206    0.228
gd          -2.097210   1.422104  -1.475    0.140

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 745.97  on 682  degrees of freedom
Residual deviance: 674.23  on 676  degrees of freedom
AIC: 688.23

Number of Fisher Scoring iterations: 5

And test their joint significance:

anova(logit,logit_preg, test="Chisq")
Analysis of Deviance Table

Model 1: kwit ~ sex + dex + lex + lex2
Model 2: kwit ~ sex + dex + lex + lex2 + ga + gd
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       678     678.42                     
2       676     674.23  2   4.1966   0.1227

Or with the lmtest package

require(lmtest)
Loading required package: lmtest
Loading required package: zoo

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
lrtest(logit,logit_preg)
Likelihood ratio test

Model 1: kwit ~ sex + dex + lex + lex2
Model 2: kwit ~ sex + dex + lex + lex2 + ga + gd
  #Df  LogLik Df  Chisq Pr(>Chisq)
1   5 -339.21                     
2   7 -337.11  2 4.1966     0.1227

The interpretation of the results is left to the reader.


  1. Please send comments to bottan2@illinois.edu or srmntbr2@illinois.edu