e-TA 16: Count Data Models
Welcome to a new issue of e-Tutorial, where we’ll focus on Count Data models, with special focus to Poisson and Negative Binomial regression. 1
Data
You can download the data set, called weco14.txt from the Econ 508 web site. Save it in your preferred directory.
The data set used here comes from A. Colin Cameron and Per Johansson, “Count Data Regression Using Series Expansion: With Applications”, Journal of Applied Econometrics, Vol. 12, No. 3, 1997, pp. 203-224.
The data is in STATA format, and you can download it from the Econ 508 web site. According to the authors, the data set is based on the 1977-78 Australian Health Survey. It contains 5190 observations on the following variables:
- NONDOCCO: Number of consultations in the past four week with non-doctor health professionals (chemist, optician, physiotherapist, etc.)
- SEX Gender of patient (female=1)
- AGE Age of patient (in years)
- INCOME Patient’s annual income (in hundreds of dollars)
- LEVYPLUS Dummy for private insurance coverage (=1)
- FREEPOOR Dummy for free government insurance coverage due to low income (=1)
- FREEREPA Dummy for free government insurance coverage due to old age, disability, or veteran status (=1)
- ILLNESS Number of illnesses in past two weeks
- ACTDAYS Number of days of reduced activity in past two weeks due to illness or injury
- HSCORE Health questionnaire score (high score=bad health)
- CHCOND1 Dummy for chronic condition not limiting activity (=1)
- CHCOND2 Dummy for chronic condition limiting activity (=1)
According to the authors, the data is overdispersed (the sample mean of the dependent variable is 0.215 and the sample variance of it is 0.932). This might be an indicator that the Poisoness property (mean equals variance) may be violated, and a Negative Binomial Regression might be necessary.
The aim of the e-TA will be to try to reproduce Cameron and Johansson (1997) main results.
First we load the data. Since it is in STATA format we will use the foreign
package to read it into R
require(foreign)
Loading required package: foreign
health<-read.dta("C:/econ508/health.dta")
or you can retrieve it from the web
health<-read.dta("http://www.econ.uiuc.edu/~econ508/data/health.dta")
head(health)
NONDOCCO SEX AGE INCOME LEVYPLUS FREEPOOR FREEREPA ILLNESS ACTDAYS
1 0 1 0.19 0.55 1 0 0 1 4
2 0 1 0.19 0.45 1 0 0 1 2
3 0 0 0.19 0.90 0 0 0 3 0
4 0 0 0.19 0.15 0 0 0 1 0
5 0 0 0.19 0.45 0 0 0 2 5
6 0 1 0.19 0.35 0 0 0 5 1
HSCORE CHCOND1 CHCOND2
1 1 0 0
2 1 0 0
3 0 0 0
4 0 0 0
5 1 1 0
6 9 1 0
Next we can extract the variables from the data frame, and generate the variable age-squared:
NONDOCCO<-health$NONDOCCO
SEX<-health$SEX
AGE<-health$AGE
INCOME<-health$INCOME
LEVYPLUS<-health$LEVYPLUS
FREEPOOR<-health$FREEPOOR
FREEREPA<-health$FREEREPA
ILLNESS<-health$ILLNESS
ACTDAYS<-health$ACTDAYS
HSCORE<-health$HSCORE
CHCOND1<-health$CHCOND1
CHCOND2<-health$CHCOND2
AGE2<-AGE^2
Finally you can run the generalized linear models of your choice. Here we will focus on how to run Poisson Regression and Negative Binomial Regression Models
Poisson Regression
model1<-glm(NONDOCCO~SEX+AGE+AGE2+INCOME+LEVYPLUS+FREEPOOR+FREEREPA+
ILLNESS+ACTDAYS+HSCORE+CHCOND1+CHCOND2, family=poisson, data=health)
summary(model1)
Call:
glm(formula = NONDOCCO ~ SEX + AGE + AGE2 + INCOME + LEVYPLUS +
FREEPOOR + FREEREPA + ILLNESS + ACTDAYS + HSCORE + CHCOND1 +
CHCOND2, family = poisson, data = health)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6128 -0.6432 -0.4722 -0.3720 7.6335
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.44362 0.24012 -10.177 < 2e-16 ***
SEX 0.33161 0.06965 4.761 1.92e-06 ***
AGE -3.30781 1.22806 -2.694 0.007070 **
AGE2 4.39034 1.30202 3.372 0.000746 ***
INCOME -0.03528 0.11135 -0.317 0.751322
LEVYPLUS 0.32790 0.09768 3.357 0.000789 ***
FREEPOOR 0.01543 0.21101 0.073 0.941712
FREEREPA 0.48208 0.11603 4.155 3.26e-05 ***
ILLNESS 0.05473 0.02155 2.539 0.011117 *
ACTDAYS 0.09792 0.00610 16.052 < 2e-16 ***
HSCORE 0.04479 0.01165 3.844 0.000121 ***
CHCOND1 0.51862 0.08703 5.959 2.54e-09 ***
CHCOND2 1.07865 0.09839 10.963 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 6127.9 on 5189 degrees of freedom
Residual deviance: 5040.9 on 5177 degrees of freedom
AIC: 6244.7
Number of Fisher Scoring iterations: 7
Negative Binomial Regression
To run a Negative Binomial, you need first to call the package MASS
, where the glm.nb
function is located:
require(MASS)
Loading required package: MASS
model2<-glm.nb(NONDOCCO~SEX+AGE+AGE2+INCOME+LEVYPLUS+FREEPOOR+FREEREPA+ILLNESS
+ACTDAYS+HSCORE+CHCOND1+CHCOND2, data=health)
summary(model2)
Call:
glm.nb(formula = NONDOCCO ~ SEX + AGE + AGE2 + INCOME + LEVYPLUS +
FREEPOOR + FREEREPA + ILLNESS + ACTDAYS + HSCORE + CHCOND1 +
CHCOND2, data = health, init.theta = 0.1122455352, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9974 -0.4691 -0.3647 -0.2959 3.5623
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.78385 0.42564 -6.540 6.13e-11 ***
SEX 0.23081 0.12587 1.834 0.06671 .
AGE -2.67555 2.32887 -1.149 0.25061
AGE2 3.85430 2.55376 1.509 0.13123
INCOME -0.06214 0.19611 -0.317 0.75136
LEVYPLUS 0.29868 0.15632 1.911 0.05604 .
FREEPOOR -0.19681 0.34861 -0.565 0.57237
FREEREPA 0.58772 0.21244 2.767 0.00567 **
ILLNESS 0.14438 0.04471 3.229 0.00124 **
ACTDAYS 0.13706 0.01661 8.253 < 2e-16 ***
HSCORE 0.07397 0.02667 2.774 0.00555 **
CHCOND1 0.41153 0.14118 2.915 0.00356 **
CHCOND2 1.12415 0.18580 6.050 1.44e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(0.1122) family taken to be 1)
Null deviance: 1783.0 on 5189 degrees of freedom
Residual deviance: 1417.1 on 5177 degrees of freedom
AIC: 4349
Number of Fisher Scoring iterations: 1
Theta: 0.11225
Std. Err.: 0.00848
2 x log-likelihood: -4320.99100
As we know the negative binomial models does not assume that conditional means are equal to the conditional variances. This inequality is captured by estimating a dispersion parameter, which is not shown in the R
output, Moreover, this parameter is constant in the poisson model. So we can use a likelihood ratio test to test:
pchisq(2 * (logLik(model2) - logLik(model1)), df = 1, lower.tail = FALSE)
'log Lik.' 0 (df=14)
What does this evidence suggest?
Please send comments to bottan2@illinois.edu or srmntbr2@illinois.edu↩