Welcome to a new issue of e-Tutorial, where we’ll focus on Count Data models, with special focus to Poisson and Negative Binomial regression. 1
You can download the data set, called weco14.txt from the Econ 536 web site. Save it in your preferred directory.
The data set used here comes from A. Colin Cameron and Per Johansson, “Count Data Regression Using Series Expansion: With Applications”, Journal of Applied Econometrics, Vol. 12, No. 3, 1997, pp. 203-224.
The data is in STATA format, and you can download it from the Econ 536 web site. According to the authors, the data set is based on the 1977-78 Australian Health Survey. It contains 5190 observations on the following variables:
According to the authors, the data is overdispersed (the sample mean of the dependent variable is 0.215 and the sample variance of it is 0.932). This might be an indicator that the Poisoness property (mean equals variance) may be violated, and a Negative Binomial Regression might be necessary.
The aim of the e-TA will be to try to reproduce Cameron and Johansson (1997) main results.
First we load the data. Since it is in STATA format we will use the foreign
package to read it into R
require(foreign)
Loading required package: foreign
health<-read.dta("C:/econ536/health.dta")
or you can retrieve it from the web
health<-read.dta("http://www.econ.uiuc.edu/~econ536/Data/health.dta")
head(health)
NONDOCCO SEX AGE INCOME LEVYPLUS FREEPOOR FREEREPA ILLNESS ACTDAYS
1 0 1 0.19 0.55 1 0 0 1 4
2 0 1 0.19 0.45 1 0 0 1 2
3 0 0 0.19 0.90 0 0 0 3 0
4 0 0 0.19 0.15 0 0 0 1 0
5 0 0 0.19 0.45 0 0 0 2 5
6 0 1 0.19 0.35 0 0 0 5 1
HSCORE CHCOND1 CHCOND2
1 1 0 0
2 1 0 0
3 0 0 0
4 0 0 0
5 1 1 0
6 9 1 0
Next we can extract the variables from the data frame, and generate the variable age-squared:
NONDOCCO<-health$NONDOCCO
SEX<-health$SEX
AGE<-health$AGE
INCOME<-health$INCOME
LEVYPLUS<-health$LEVYPLUS
FREEPOOR<-health$FREEPOOR
FREEREPA<-health$FREEREPA
ILLNESS<-health$ILLNESS
ACTDAYS<-health$ACTDAYS
HSCORE<-health$HSCORE
CHCOND1<-health$CHCOND1
CHCOND2<-health$CHCOND2
AGE2<-AGE^2
Finally you can run the generalized linear models of your choice. Here we will focus on how to run Poisson Regression and Negative Binomial Regression Models
model1<-glm(NONDOCCO~SEX+AGE+AGE2+INCOME+LEVYPLUS+FREEPOOR+FREEREPA+
ILLNESS+ACTDAYS+HSCORE+CHCOND1+CHCOND2, family=poisson, data=health)
summary(model1)
Call:
glm(formula = NONDOCCO ~ SEX + AGE + AGE2 + INCOME + LEVYPLUS +
FREEPOOR + FREEREPA + ILLNESS + ACTDAYS + HSCORE + CHCOND1 +
CHCOND2, family = poisson, data = health)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6128 -0.6432 -0.4722 -0.3720 7.6335
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.44362 0.24012 -10.177 < 2e-16 ***
SEX 0.33161 0.06965 4.761 1.92e-06 ***
AGE -3.30781 1.22806 -2.694 0.007070 **
AGE2 4.39034 1.30202 3.372 0.000746 ***
INCOME -0.03528 0.11135 -0.317 0.751322
LEVYPLUS 0.32790 0.09768 3.357 0.000789 ***
FREEPOOR 0.01543 0.21101 0.073 0.941712
FREEREPA 0.48208 0.11603 4.155 3.26e-05 ***
ILLNESS 0.05473 0.02155 2.539 0.011117 *
ACTDAYS 0.09792 0.00610 16.052 < 2e-16 ***
HSCORE 0.04479 0.01165 3.844 0.000121 ***
CHCOND1 0.51862 0.08703 5.959 2.54e-09 ***
CHCOND2 1.07865 0.09839 10.963 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 6127.9 on 5189 degrees of freedom
Residual deviance: 5040.9 on 5177 degrees of freedom
AIC: 6244.7
Number of Fisher Scoring iterations: 7
To run a Negative Binomial, you need first to call the package MASS
, where the glm.nb
function is located:
require(MASS)
Loading required package: MASS
model2<-glm.nb(NONDOCCO~SEX+AGE+AGE2+INCOME+LEVYPLUS+FREEPOOR+FREEREPA+ILLNESS
+ACTDAYS+HSCORE+CHCOND1+CHCOND2, data=health)
summary(model2)
Call:
glm.nb(formula = NONDOCCO ~ SEX + AGE + AGE2 + INCOME + LEVYPLUS +
FREEPOOR + FREEREPA + ILLNESS + ACTDAYS + HSCORE + CHCOND1 +
CHCOND2, data = health, init.theta = 0.1122455352, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9974 -0.4691 -0.3647 -0.2959 3.5623
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.78385 0.42564 -6.540 6.13e-11 ***
SEX 0.23081 0.12587 1.834 0.06671 .
AGE -2.67555 2.32887 -1.149 0.25061
AGE2 3.85430 2.55376 1.509 0.13123
INCOME -0.06214 0.19611 -0.317 0.75136
LEVYPLUS 0.29868 0.15632 1.911 0.05604 .
FREEPOOR -0.19681 0.34861 -0.565 0.57237
FREEREPA 0.58772 0.21244 2.767 0.00567 **
ILLNESS 0.14438 0.04471 3.229 0.00124 **
ACTDAYS 0.13706 0.01661 8.253 < 2e-16 ***
HSCORE 0.07397 0.02667 2.774 0.00555 **
CHCOND1 0.41153 0.14118 2.915 0.00356 **
CHCOND2 1.12415 0.18580 6.050 1.44e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(0.1122) family taken to be 1)
Null deviance: 1783.0 on 5189 degrees of freedom
Residual deviance: 1417.1 on 5177 degrees of freedom
AIC: 4349
Number of Fisher Scoring iterations: 1
Theta: 0.11225
Std. Err.: 0.00848
2 x log-likelihood: -4320.99100
As we know the negative binomial models does not assume that conditional means are equal to the conditional variances. This inequality is captured by estimating a dispersion parameter, which is not shown in the R
output, Moreover, this parameter is constant in the poisson model. So we can use a likelihood ratio test to test:
pchisq(2 * (logLik(model2) - logLik(model1)), df = 1, lower.tail = FALSE)
'log Lik.' 0 (df=14)
What does this evidence suggest?
Please send comments to hrtdmrt2@illinois.edu or srmntbr2@illinois.edu↩