Welcome to a new issue of e-Tutorial, where we’ll focus on Count Data models, with special focus to Poisson and Negative Binomial regression. ¹

Data

You can download the data set, called weco14.txt from the Econ 536 web site. Save it in your preferred directory.

The data set used here comes from A. Colin Cameron and Per Johansson, “Count Data Regression Using Series Expansion: With Applications”, Journal of Applied Econometrics, Vol. 12, No. 3, 1997, pp. 203-224.

The data is in STATA format, and you can download it from the Econ 536 web site. According to the authors, the data set is based on the 1977-78 Australian Health Survey. It contains 5190 observations on the following variables:

NONDOCCO: Number of consultations in the past four week with non-doctor health professionals (chemist, optician, physiotherapist, etc.)
SEX Gender of patient (female=1)
AGE Age of patient (in years)
INCOME Patient’s annual income (in hundreds of dollars)
LEVYPLUS Dummy for private insurance coverage (=1)
FREEPOOR Dummy for free government insurance coverage due to low income (=1)
FREEREPA Dummy for free government insurance coverage due to old age, disability, or veteran status (=1)
ILLNESS Number of illnesses in past two weeks
ACTDAYS Number of days of reduced activity in past two weeks due to illness or injury
HSCORE Health questionnaire score (high score=bad health)
CHCOND1 Dummy for chronic condition not limiting activity (=1)
CHCOND2 Dummy for chronic condition limiting activity (=1)

According to the authors, the data is overdispersed (the sample mean of the dependent variable is 0.215 and the sample variance of it is 0.932). This might be an indicator that the Poisoness property (mean equals variance) may be violated, and a Negative Binomial Regression might be necessary.

The aim of the e-TA will be to try to reproduce Cameron and Johansson (1997) main results.

First we load the data. Since it is in STATA format we will use the foreign package to read it into R

require(foreign)

Loading required package: foreign

health<-read.dta("C:/econ536/health.dta")

or you can retrieve it from the web

health<-read.dta("http://www.econ.uiuc.edu/~econ536/Data/health.dta")
head(health)

  NONDOCCO SEX  AGE INCOME LEVYPLUS FREEPOOR FREEREPA ILLNESS ACTDAYS
1        0   1 0.19   0.55        1        0        0       1       4
2        0   1 0.19   0.45        1        0        0       1       2
3        0   0 0.19   0.90        0        0        0       3       0
4        0   0 0.19   0.15        0        0        0       1       0
5        0   0 0.19   0.45        0        0        0       2       5
6        0   1 0.19   0.35        0        0        0       5       1
  HSCORE CHCOND1 CHCOND2
1      1       0       0
2      1       0       0
3      0       0       0
4      0       0       0
5      1       1       0
6      9       1       0

Next we can extract the variables from the data frame, and generate the variable age-squared:

NONDOCCO<-health$NONDOCCO 
SEX<-health$SEX 
AGE<-health$AGE 
INCOME<-health$INCOME 
LEVYPLUS<-health$LEVYPLUS 
FREEPOOR<-health$FREEPOOR 
FREEREPA<-health$FREEREPA 
ILLNESS<-health$ILLNESS 
ACTDAYS<-health$ACTDAYS 
HSCORE<-health$HSCORE 
CHCOND1<-health$CHCOND1 
CHCOND2<-health$CHCOND2 
AGE2<-AGE^2

Finally you can run the generalized linear models of your choice. Here we will focus on how to run Poisson Regression and Negative Binomial Regression Models

Poisson Regression

model1<-glm(NONDOCCO~SEX+AGE+AGE2+INCOME+LEVYPLUS+FREEPOOR+FREEREPA+ 
ILLNESS+ACTDAYS+HSCORE+CHCOND1+CHCOND2, family=poisson, data=health)

summary(model1)


Call:
glm(formula = NONDOCCO ~ SEX + AGE + AGE2 + INCOME + LEVYPLUS + 
    FREEPOOR + FREEREPA + ILLNESS + ACTDAYS + HSCORE + CHCOND1 + 
    CHCOND2, family = poisson, data = health)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6128  -0.6432  -0.4722  -0.3720   7.6335  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.44362    0.24012 -10.177  < 2e-16 ***
SEX          0.33161    0.06965   4.761 1.92e-06 ***
AGE         -3.30781    1.22806  -2.694 0.007070 ** 
AGE2         4.39034    1.30202   3.372 0.000746 ***
INCOME      -0.03528    0.11135  -0.317 0.751322    
LEVYPLUS     0.32790    0.09768   3.357 0.000789 ***
FREEPOOR     0.01543    0.21101   0.073 0.941712    
FREEREPA     0.48208    0.11603   4.155 3.26e-05 ***
ILLNESS      0.05473    0.02155   2.539 0.011117 *  
ACTDAYS      0.09792    0.00610  16.052  < 2e-16 ***
HSCORE       0.04479    0.01165   3.844 0.000121 ***
CHCOND1      0.51862    0.08703   5.959 2.54e-09 ***
CHCOND2      1.07865    0.09839  10.963  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 6127.9  on 5189  degrees of freedom
Residual deviance: 5040.9  on 5177  degrees of freedom
AIC: 6244.7

Number of Fisher Scoring iterations: 7

Negative Binomial Regression

To run a Negative Binomial, you need first to call the package MASS, where the glm.nb function is located:

require(MASS)

Loading required package: MASS

model2<-glm.nb(NONDOCCO~SEX+AGE+AGE2+INCOME+LEVYPLUS+FREEPOOR+FREEREPA+ILLNESS 
+ACTDAYS+HSCORE+CHCOND1+CHCOND2, data=health) 
summary(model2)


Call:
glm.nb(formula = NONDOCCO ~ SEX + AGE + AGE2 + INCOME + LEVYPLUS + 
    FREEPOOR + FREEREPA + ILLNESS + ACTDAYS + HSCORE + CHCOND1 + 
    CHCOND2, data = health, init.theta = 0.1122455352, link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9974  -0.4691  -0.3647  -0.2959   3.5623  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.78385    0.42564  -6.540 6.13e-11 ***
SEX          0.23081    0.12587   1.834  0.06671 .  
AGE         -2.67555    2.32887  -1.149  0.25061    
AGE2         3.85430    2.55376   1.509  0.13123    
INCOME      -0.06214    0.19611  -0.317  0.75136    
LEVYPLUS     0.29868    0.15632   1.911  0.05604 .  
FREEPOOR    -0.19681    0.34861  -0.565  0.57237    
FREEREPA     0.58772    0.21244   2.767  0.00567 ** 
ILLNESS      0.14438    0.04471   3.229  0.00124 ** 
ACTDAYS      0.13706    0.01661   8.253  < 2e-16 ***
HSCORE       0.07397    0.02667   2.774  0.00555 ** 
CHCOND1      0.41153    0.14118   2.915  0.00356 ** 
CHCOND2      1.12415    0.18580   6.050 1.44e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(0.1122) family taken to be 1)

    Null deviance: 1783.0  on 5189  degrees of freedom
Residual deviance: 1417.1  on 5177  degrees of freedom
AIC: 4349

Number of Fisher Scoring iterations: 1

              Theta:  0.11225 
          Std. Err.:  0.00848 

 2 x log-likelihood:  -4320.99100

As we know the negative binomial models does not assume that conditional means are equal to the conditional variances. This inequality is captured by estimating a dispersion parameter, which is not shown in the R output, Moreover, this parameter is constant in the poisson model. So we can use a likelihood ratio test to test:

pchisq(2 * (logLik(model2) - logLik(model1)), df = 1, lower.tail = FALSE)

'log Lik.' 0 (df=14)

What does this evidence suggest?

Please send comments to hrtdmrt2@illinois.edu or srmntbr2@illinois.edu ↩

e-TA 16: Count Data Models

Data

Poisson Regression

Negative Binomial Regression