Applied Econometrics at the University of Illinois: e-Tutorial 16: Binary Data Models

	Econ 508	Econometrics Group
Home \| Faculty \| Students \| Alumni \| Courses \| Research \| Reproducibility \| Lab \| Seminars \| Economics \| Statistics \| Fame

Applied Econometrics
Econ 508 - Fall 2007

e-Tutorial 16: Binary Data Models

Welcome to the sixteenth issue of e-Tutorial. This time we focus on Binary Data Models, with special focus to Logit and Probit regressions.

Data

You can download your data from the Econ 508 web page (here) and save the file in your preferred directory (I'll save mine as "C:\weco.dat"). Then you open STATA and type:

infile y sex dex lex kwit tenure censored using "C:\weco.dat"

Drop the first line of the data set containing missing values (due to the labels of variables).

Next you generate the variable lex squared:
gen lex2=lex^2

Then save the file in STATA format (I'll save mine as "C:\weco.dta").

Question 3:

On part (a) You need to estimate a simple Logit model:

logit(P(quit=1))=(b₀+ b₁*sex + b₂*dex + b₃*lex + b₄*lex2)

In STATA, I will use a subsample of the data set to demonstrate how to obtain the main results. My subsample contains only 257 observations, obtained from dropping lex==12. My results may differ from the original data set in PS5:

logit kwit sex dex lex lex2
Iteration 0:   log likelihood = -150.50058
Iteration 1:   log likelihood = -140.12819
Iteration 2:   log likelihood = -139.86199
Iteration 3:   log likelihood = -139.86135
Logit estimates                                   Number of obs   =        257
                                                  LR chi2(4)      =      21.28
                                                  Prob > chi2     =     0.0003
Log likelihood = -139.86135                       Pseudo R2       =     0.0707
------------------------------------------------------------------------------
    kwit |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     sex |   .7543254   .3063485      2.462   0.014       .1538934    1.354757
     dex |   -.079547   .0209625     -3.795   0.000      -.1206327   -.0384612
     lex | -.6918966   .7252067     -0.954   0.340      -2.113276    .7294824
    lex2 |   .0271202   .0290853      0.932   0.351      -.0298859    .0841263
   _cons |   6.371123   4.581511      1.391   0.164      -2.608474    15.35072
------------------------------------------------------------------------------

This is equivalent to estimate Pr(kwit=1)=exp(x_jb)/(1+exp(x_jb)). The results above show that, coeteris paribus, workers with higher dexterity are less likely to quit, while males (sex=1) have a bigger tendency to quit than females (sex=0). In other words, positive coefficients contribute to increase the probability of quitting, while negative coefficients, to reduce it. Schooling is not significant in the probability of quitting.

To draw a picture of the probability of quitting as a function of years of education, holding everything else constant, you need first to ask for the summary of dexterity for the pooled sample, and then for males and females:

summarize dex

Variable | Obs Mean Std. Dev. Min Max
---------+-----------------------------------------------------
dex | 257 44.6537 7.609662 23 64

summarize dex if sex==0

Variable | Obs Mean Std. Dev. Min Max
---------+-----------------------------------------------------
dex | 119 43.90756 7.466299 23 61

summarize dex if sex==1

Variable | Obs Mean Std. Dev. Min Max
---------+-----------------------------------------------------
dex | 138 45.2971 7.700042 25 64

Ok. Then you need to calculate the expected probability of quitting, Prob(quit=1)=exp(x_ib)/(1+exp(x_ib)). This can be obtained in STATA using the command predict:

predict p
summarize kwit p

Variable |     Obs        Mean   Std. Dev.       Min        Max
---------+-----------------------------------------------------
    kwit |     257    .2723735   .4460497          0          1
       p |     257    .2723735   .1270982   .0525215   .7170406

In the table above, kwit is the binary dependent variable, and p is the predicted value of it. But we need to draw a graph of the expected probability of quitting at different years of education, holding the remaining explanatory variables fixed (e.g., at their mean value). Thus, let's ask for the expected value of the probability of quitting conditioned to dexterity being hold at the pooled mean value:

replace dex= 44.6537
predict kpool
graph kpool lex, title(Expected Probability of Quitting at Dex=Pooled Mean) ylab

In the graph above you see two different curves -- one for male (upper curve) and another for female (lower curve). This happens because we left all explanatory variables constant at their means (in our case, only dex was fixed), but we had to leave sex at its original values (because the average of the dummy variable does not make much sense in this case).

On part (b), you are asked to examine better the effect of gender. A first suggestion is to tabulate sen and kwit:

tabulate sex kwit
           |         kwit
       sex |         0          1 |     Total
-----------+----------------------+----------
         0 |        94         25 |       119
         1 |        93         45 |       138
-----------+----------------------+----------
     Total |       187         70 |       257

In the table above you can see that 25 out of the existing 119 females in this subsample are quitters, while 45 out of 138 males are quitters. So, the proportion of male quitters (33%) is greater than the proportion of female quitters (21%).

Next you can draw graphs of the expected probability of quitting for each gender, using their respective dexterity means. From the results above we know that the dexterity mean for females is 43.90756, and the dexterity mean for males is 45.2971. In STATA, we can ask for the graphs as follows:

replace dex= 43.90756 if sex==0
replace dex=45.2971 if sex==1
predict kfemale if sex==0
label var kfemale "Women"
predict kmale if sex==1
label var kmale "Men"
graph kmale kfemale lex, c(ss) ylab title(Expected Probability of Quitting by Gender)

Besides the graphical analysis, you can also test for the shape of the education effect, by introducing the variables sex*lex and sex*lex2 in the model, and checking their significance.

On part (c): You need to evaluate the Logit specification by computing the Pregibon diagnostic. The first thing I recommend is to open the data set again, given that you have replaced dex by its mean values in the construction of the graphs above:

use "C:\weco.dta", clear

And then re-run the original Logit model:

logit kwit sex dex lex lex2

Next you generate the predicted probabilities of quitting, called p, and compute g^a (parameter that controls the fatness of tails) and g^d (parameter that controls symmetry):

predict p
gen ga=.5*(log(p)*log(p) - log(1-p)*log(1-p))
gen gd=- .5*(log(p)*log(p) + log(1-p)*log(1-p))

Finally you run an extended Logit model, including the variables g^a and g^d:

logit kwit sex dex lex lex2 ga gd
Iteration 0:   log likelihood = -396.95597
Iteration 1:   log likelihood = -362.24889
Iteration 2:   log likelihood =   -361.753
Iteration 3:   log likelihood = -361.75274
Iteration 4:   log likelihood = -361.75274
Logit estimates                                   Number of obs   =        683
                                                  LR chi2(6)      =      70.41
                                                  Prob > chi2     =     0.0000
Log likelihood = -361.75274                       Pseudo R2       =     0.0887
------------------------------------------------------------------------------
    kwit |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     sex |   .7707522    .696531      1.107   0.268      -.5944234    2.135928
     dex | -.1642033   .1325885     -1.238   0.216      -.4240719    .0956653
     lex | -.8514418   1.021855     -0.833   0.405      -2.854241    1.151358
    lex2 |   .0335875   .0406379      0.827   0.409      -.0460612    .1132363
      ga | -.8567156   2.755189     -0.311   0.756      -6.256788    4.543356
      gd | -1.837657   1.998901     -0.919   0.358       -5.75543    2.080117
   _cons |   9.853968   10.55188      0.934   0.350      -10.82733    30.53527
------------------------------------------------------------------------------

And test their joint significance:

test ga gd
( 1) ga = 0.0
( 2) gd = 0.0
chi2( 2) = 13.01
Prob > chi2 = 0.0015

From the results above we can see that the parameters are jointly significantly different than zero. Hence, the Logit model is not a good representation for this data.

On part (d): You will provide your own economic assessment of the problem.

Last update: November 28, 2007