Econ 508
Econometrics Group
Home | Faculty | Students | Alumni | Courses | Research | Reproducibility | Lab | Seminars | Economics | Statistics | Fame
Applied Econometrics
Econ 508 - Fall 2007

e-Tutorial 17: Count Data Models

Welcome to the seventeenth issue of e-Tutorial. This time we focus on Count Data models, with special focus to Poisson and Negative Binomial regression. 

Data

The data set used here comes from A. Colin Cameron and Per Johansson, "Count Data Regression Using Series Expansion:  With Applications", Journal of Applied Econometrics, Vol. 12, No. 3, 1997, pp. 203-224. I already saved the original data in STATA format, so you can download it here (or visit the Econ 508 web site). For R users, the Appendix A introduces the package "foreign", which is very helpful for the translation from STATA (and others) into plain ascii format.

According to the authors, the data set is based on the 1977-78 Australian Health Survey. It contains 5190 observations on the following variables:
 
Dependent Variable: Meaning:
NONDOCCO  Number of consultations in the past four week with non-doctor health professionals (chemist, optician, physiotherapist, etc.)
Regressors:
SEX  Gender of patient (female=1)
AGE  Age of patient (in years)
INCOME  Patient's annual income (in hundreds of dollars) 
LEVYPLUS  Dummy for private insurance coverage (=1)
FREEPOOR  Dummy for free government insurance coverage due to low income (=1)
FREEREPA  Dummy for free government insurance coverage due to old age, disability, or veteran status (=1)
ILLNESS  Number of illnesses in past two weeks
ACTDAYS  Number of days of reduced activity in past two weeks due to illness or injury
HSCORE  Health questionnaire score (high score=bad health)
CHCOND1  Dummy for chronic condition not limiting activity (=1)
CHCOND2  Dummy for chronic condition limiting activity (=1)

According to the authors, the data is overdispersed (the sample mean of the dependent variable is 0.215 and the sample variance of it is 0.932).  This might be an indicator that the Poisoness property (mean equals variance) may be violated, and a Negative Binomial Regression might be necessary. 

Let's try to reproduce Cameron and Johansson (1997) main results. 
 

In STATA:

You need to generate the variable age-squared:
gen AGE2=AGE^2

Poisson Regression

You can run a Poisson Regression as follows:

poisson  NONDOCCO SEX AGE AGE2 INCOME LEVYPLUS FREEPOOR FREEREPA ILLNESS ACTDAYS HSCORE CHCOND1 CHCOND2

Iteration 0:   log likelihood = -4205.1938  
Iteration 1:   log likelihood = -3382.4299  
Iteration 2:   log likelihood = -3112.8639  
Iteration 3:   log likelihood = -3109.3816  
Iteration 4:   log likelihood = -3109.3722  
Iteration 5:   log likelihood = -3109.3722  

Poisson regression                                Number of obs   =       5190
                                                  LR chi2(12)     =    1086.94
                                                  Prob > chi2     =     0.0000
Log likelihood = -3109.3722                       Pseudo R2       =     0.1488

------------------------------------------------------------------------------
NONDOCCO |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     SEX |   .3316144   .0696475      4.761   0.000       .1951079    .4681209
     AGE |  -3.307805   1.228064     -2.694   0.007      -5.714766   -.9008437
    AGE2 |    4.39034    1.30202      3.372   0.001       1.838428    6.942252
  INCOME |  -.0352855   .1113467     -0.317   0.751      -.2535211    .1829501
LEVYPLUS |   .3278973    .097685      3.357   0.001       .1364383    .5193564
FREEPOOR |   .0154283   .2110056      0.073   0.942      -.3981352    .4289917
FREEREPA |   .4820755   .1160326      4.155   0.000       .2546559    .7094952
 ILLNESS |    .054726   .0215542      2.539   0.011       .0124806    .0969714
 ACTDAYS |   .0979188   .0061003     16.052   0.000       .0859625    .1098751
  HSCORE |   .0447936   .0116531      3.844   0.000       .0219539    .0676332
 CHCOND1 |   .5186225    .087033      5.959   0.000        .348041     .689204
 CHCOND2 |   1.078644   .0983912     10.963   0.000       .8858014    1.271488
   _cons |  -2.443619   .2401184    -10.177   0.000      -2.914242   -1.972995
------------------------------------------------------------------------------

You can check the "Poisoness" property by typing:

poisgof
         Goodness of fit chi-2 =  5040.935
         Prob > chi2(5177)     =    0.9103

The null hypothesis (of Poisoness) can not be rejected in the test above, meaning that a Poisson Regression is fine for this data. Nevertheless, I will show how to compute the Negative Binomial Regression anyway.
 

Negative Binomial Regression

You can run a Negative Binomial Regression as follows:

nbreg  NONDOCCO SEX AGE AGE2 INCOME LEVYPLUS FREEPOOR FREEREPA ILLNESS ACTDAYS HSCORE CHCOND1 CHCOND2

Fitting comparison Poisson model:

Iteration 0:   log likelihood = -4205.1938  
Iteration 1:   log likelihood = -3382.4299  
Iteration 2:   log likelihood = -3112.8639  
Iteration 3:   log likelihood = -3109.3816  
Iteration 4:   log likelihood = -3109.3722  
Iteration 5:   log likelihood = -3109.3722  

Fitting constant-only model:

Iteration 0:   log likelihood =  -2940.014  (not concave)
Iteration 1:   log likelihood = -2313.0946  
Iteration 2:   log likelihood =  -2312.632  
Iteration 3:   log likelihood =  -2312.632  

Fitting full model:

Iteration 0:   log likelihood = -2214.1623  
Iteration 1:   log likelihood = -2205.8669  
Iteration 2:   log likelihood = -2161.0568  
Iteration 3:   log likelihood = -2160.4958  
Iteration 4:   log likelihood = -2160.4953  
Iteration 5:   log likelihood = -2160.4953  

Negative binomial regression                      Number of obs   =       5190
                                                  LR chi2(12)     =     304.27
                                                  Prob > chi2     =     0.0000
Log likelihood = -2160.4953                       Pseudo R2       =     0.0658

------------------------------------------------------------------------------
NONDOCCO |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     SEX |   .2308069   .1246663      1.851   0.064      -.0135347    .4751484
     AGE |  -2.675554   2.430949     -1.101   0.271      -7.440127     2.08902
    AGE2 |   3.854298    2.62543      1.468   0.142      -1.291451    9.000047
  INCOME |  -.0621359   .1906016     -0.326   0.744      -.4357083    .3114364
LEVYPLUS |   .2986752   .1583588      1.886   0.059      -.0117024    .6090528
FREEPOOR |   -.196811   .3515133     -0.560   0.576      -.8857645    .4921425
FREEREPA |   .5877179   .2185872      2.689   0.007       .1592949    1.016141
 ILLNESS |   .1443791   .0467823      3.086   0.002       .0526875    .2360708
 ACTDAYS |   .1370558   .0170753      8.027   0.000       .1035888    .1705228
  HSCORE |   .0739655   .0279625      2.645   0.008       .0191601    .1287709
 CHCOND1 |   .4115285    .142977      2.878   0.004       .1312987    .6917583
 CHCOND2 |   1.124148   .1830997      6.140   0.000       .7652796    1.483017
   _cons |  -2.783845   .4351314     -6.398   0.000      -3.636687   -1.931003
---------+--------------------------------------------------------------------
/lnalpha |   2.187067   .0758164                          2.038469    2.335664
---------+--------------------------------------------------------------------
   alpha |    8.90904   .6754515     13.190   0.000       7.678844    10.33632
------------------------------------------------------------------------------
Likelihood ratio test of alpha=0:    chi2(1) =  1897.75   Prob > chi2 = 0.0000
 
 

Appendix A: Count Data Models in R

The first thing you need to do is to download the package "foreign", such that you can translate the data set (helth.dta) into ascii format. You can do that as follows:

install.packages("foreign")
library(foreign)
health<-read.dta("C:/health.dta")
summary(health)

Next you need to extract the variables from the data frame, and generate the variable age-squared:

NONDOCCO<-health$NONDOCCO
SEX<-health$SEX
AGE<-health$AGE
INCOME<-health$INCOME
LEVYPLUS<-health$LEVYPLUS
FREEPOOR<-health$FREEPOOR
FREEREPA<-health$FREEREPA
ILLNESS<-health$ILLNESS
ACTDAYS<-health$ACTDAYS
HSCORE<-health$HSCORE
CHCOND1<-health$CHCOND1
CHCOND2<-health$CHCOND2
AGE2<-AGE^2

Finally you can run the generalized linear models of your choice. Here I will show how to run Poisson Regression and Negative Binomial Regression:
 

Poisson Regression

help(glm)
model1<-glm(NONDOCCO~SEX+AGE+AGE2+INCOME+LEVYPLUS+FREEPOOR+FREEREPA+
ILLNESS+ACTDAYS+HSCORE+CHCOND1+CHCOND2, family=poisson)

summary(model1)

Call:
glm(formula = NONDOCCO ~ SEX + AGE + AGE2 + INCOME + LEVYPLUS + 
    FREEPOOR + FREEREPA + ILLNESS + ACTDAYS + HSCORE + CHCOND1 + 
    CHCOND2, family = poisson)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6134  -0.6432  -0.4722  -0.3720   7.6330  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.443790   0.238863 -10.231  < 2e-16 ***
SEX          0.331540   0.069182   4.792 1.65e-06 ***
AGE         -3.306751   1.220280  -2.710 0.006732 ** 
AGE2         4.388976   1.293001   3.394 0.000688 ***
INCOME      -0.035329   0.110904  -0.319 0.750065    
LEVYPLUS     0.327930   0.097397   3.367 0.000760 ***
FREEPOOR     0.015368   0.210538   0.073 0.941810    
FREEREPA     0.482330   0.115504   4.176 2.97e-05 ***
ILLNESS      0.054750   0.021383   2.560 0.010455 *  
ACTDAYS      0.097933   0.006041  16.212  < 2e-16 ***
HSCORE       0.044825   0.011521   3.891 0.000100 ***
CHCOND1      0.518584   0.086754   5.978 2.26e-09 ***
CHCOND2      1.078660   0.097878  11.021  < 2e-16 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 6127.9  on 5189  degrees of freedom
Residual deviance: 5040.9  on 5177  degrees of freedom
AIC: 6244.7

Number of Fisher Scoring iterations: 5
 

Negative Binomial Regression

To run a Negative Binomial, you need to download the library MASS:

library(MASS)
help(glm.nb)
model2<-glm.nb(NONDOCCO~SEX+AGE+AGE2+INCOME+LEVYPLUS+FREEPOOR+FREEREPA+ILLNESS
+ACTDAYS+HSCORE+CHCOND1+CHCOND2)
summary(model2)

Call:
glm.nb(formula = NONDOCCO ~ SEX + AGE + AGE2 + INCOME + LEVYPLUS + 
    FREEPOOR + FREEREPA + ILLNESS + ACTDAYS + HSCORE + CHCOND1 + 
    CHCOND2, init.theta = 0.112257226738908, link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9973  -0.4692  -0.3647  -0.2960   3.5625  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.78263    0.42559  -6.538 6.22e-11 ***
SEX          0.23084    0.12586   1.834  0.06663 .  
AGE         -2.68034    2.32877  -1.151  0.24974    
AGE2         3.85835    2.55362   1.511  0.13081    
INCOME      -0.06209    0.19609  -0.317  0.75153    
LEVYPLUS     0.29899    0.15630   1.913  0.05577 .  
FREEPOOR    -0.19661    0.34860  -0.564  0.57275    
FREEREPA     0.58787    0.21242   2.767  0.00565 ** 
ILLNESS      0.14428    0.04471   3.227  0.00125 ** 
ACTDAYS      0.13702    0.01661   8.251  < 2e-16 ***
HSCORE       0.07390    0.02667   2.771  0.00559 ** 
CHCOND1      0.41161    0.14116   2.916  0.00355 ** 
CHCOND2      1.12361    0.18578   6.048 1.47e-09 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

(Dispersion parameter for Negative Binomial(0.1123) family taken to be 1)

    Null deviance: 1783.3  on 5189  degrees of freedom
Residual deviance: 1417.4  on 5177  degrees of freedom
AIC: 10414

Number of Fisher Scoring iterations: 1

Correlation of Coefficients:
         ( S AGE AGE2 IN L FREEP FREER IL AC H CHCOND1
SEX        1                                          
AGE      +   1                                        
AGE2     +   B   1                                    
INCOME                1                               
LEVYPLUS                 1                            
FREEPOOR                   1                          
FREEREPA                 ,       1                    
ILLNESS                                1              
ACTDAYS                                   1           
HSCORE                                       1        
CHCOND1                                        1      
CHCOND2                                        .      
attr(,"legend")
[1] 0 ` ' 0.3 `.' 0.6 `,' 0.8 `+' 0.9 `*' 0.95 `B' 1
 

              Theta:  0.11226 
          Std. Err.:  0.00848 

 2 x log-likelihood:  -4320.99100 
 

That's all for today. I hope it helps.

 

 Last update: November 28, 2007