Welcome
to the seventeenth issue of e-Tutorial. This time we focus on Count Data
models, with special focus to Poisson and Negative Binomial regression.
Data
The data set used here comes
from A. Colin Cameron and Per Johansson, "Count
Data Regression Using Series Expansion: With Applications", Journal
of Applied Econometrics, Vol. 12, No. 3, 1997, pp. 203-224. I already
saved the original data in STATA format, so you can download it here
(or visit the Econ
508 web site). For R users, the Appendix A
introduces the package "foreign", which is very helpful for the translation
from STATA (and others) into plain ascii format.
According to the authors,
the data set is based on the 1977-78 Australian Health Survey. It contains
5190 observations on the following variables:
Dependent
Variable: |
Meaning: |
NONDOCCO |
Number of consultations
in the past four week with non-doctor health professionals (chemist, optician,
physiotherapist, etc.) |
|
|
Regressors: |
|
SEX |
Gender of patient (female=1) |
AGE |
Age of patient (in years) |
INCOME |
Patient's annual income
(in hundreds of dollars) |
LEVYPLUS |
Dummy for private insurance
coverage (=1) |
FREEPOOR |
Dummy for free government
insurance coverage due to low income (=1) |
FREEREPA |
Dummy for free government
insurance coverage due to old age, disability, or veteran status (=1) |
ILLNESS |
Number of illnesses in
past two weeks |
ACTDAYS |
Number of days of reduced
activity in past two weeks due to illness or injury |
HSCORE |
Health questionnaire score
(high score=bad health) |
CHCOND1 |
Dummy for chronic condition
not limiting activity (=1) |
CHCOND2 |
Dummy for chronic condition
limiting activity (=1) |
According to the authors,
the data is overdispersed (the sample mean of the dependent variable is
0.215 and the sample variance of it is 0.932). This might be an indicator
that the Poisoness property (mean equals variance) may be violated, and
a Negative Binomial Regression might be necessary.
Let's try to reproduce Cameron
and Johansson (1997) main results.
In STATA:
You need to generate the
variable age-squared:
gen
AGE2=AGE^2
Poisson Regression
You can run a Poisson Regression
as follows:
poisson
NONDOCCO SEX AGE AGE2 INCOME LEVYPLUS FREEPOOR FREEREPA ILLNESS ACTDAYS
HSCORE CHCOND1 CHCOND2
Iteration
0: log likelihood = -4205.1938
Iteration
1: log likelihood = -3382.4299
Iteration
2: log likelihood = -3112.8639
Iteration
3: log likelihood = -3109.3816
Iteration
4: log likelihood = -3109.3722
Iteration
5: log likelihood = -3109.3722
Poisson
regression
Number of obs = 5190
LR chi2(12) = 1086.94
Prob > chi2 = 0.0000
Log
likelihood = -3109.3722
Pseudo R2 =
0.1488
------------------------------------------------------------------------------
NONDOCCO
| Coef. Std. Err.
z P>|z| [95%
Conf. Interval]
---------+--------------------------------------------------------------------
SEX | .3316144 .0696475
4.761 0.000 .1951079
.4681209
AGE | -3.307805 1.228064 -2.694
0.007 -5.714766 -.9008437
AGE2 | 4.39034 1.30202
3.372 0.001 1.838428
6.942252
INCOME | -.0352855 .1113467 -0.317
0.751 -.2535211 .1829501
LEVYPLUS
| .3278973 .097685
3.357 0.001 .1364383
.5193564
FREEPOOR
| .0154283 .2110056
0.073 0.942 -.3981352
.4289917
FREEREPA
| .4820755 .1160326
4.155 0.000 .2546559
.7094952
ILLNESS
| .054726 .0215542
2.539 0.011 .0124806
.0969714
ACTDAYS
| .0979188 .0061003 16.052
0.000 .0859625 .1098751
HSCORE | .0447936 .0116531
3.844 0.000 .0219539
.0676332
CHCOND1
| .5186225 .087033
5.959 0.000 .348041
.689204
CHCOND2
| 1.078644 .0983912 10.963
0.000 .8858014 1.271488
_cons | -2.443619 .2401184 -10.177
0.000 -2.914242 -1.972995
------------------------------------------------------------------------------
You can check the "Poisoness"
property by typing:
poisgof
Goodness of fit chi-2 = 5040.935
Prob > chi2(5177) = 0.9103
The null hypothesis (of
Poisoness) can not be rejected in the test above, meaning that a Poisson
Regression is fine for this data. Nevertheless, I will show how to compute
the Negative Binomial Regression anyway.
Negative Binomial Regression
You can run a Negative Binomial
Regression as follows:
nbreg
NONDOCCO SEX AGE AGE2 INCOME LEVYPLUS FREEPOOR FREEREPA ILLNESS ACTDAYS
HSCORE CHCOND1 CHCOND2
Fitting
comparison Poisson model:
Iteration
0: log likelihood = -4205.1938
Iteration
1: log likelihood = -3382.4299
Iteration
2: log likelihood = -3112.8639
Iteration
3: log likelihood = -3109.3816
Iteration
4: log likelihood = -3109.3722
Iteration
5: log likelihood = -3109.3722
Fitting
constant-only model:
Iteration
0: log likelihood = -2940.014 (not concave)
Iteration
1: log likelihood = -2313.0946
Iteration
2: log likelihood = -2312.632
Iteration
3: log likelihood = -2312.632
Fitting
full model:
Iteration
0: log likelihood = -2214.1623
Iteration
1: log likelihood = -2205.8669
Iteration
2: log likelihood = -2161.0568
Iteration
3: log likelihood = -2160.4958
Iteration
4: log likelihood = -2160.4953
Iteration
5: log likelihood = -2160.4953
Negative
binomial regression
Number of obs = 5190
LR chi2(12) = 304.27
Prob > chi2 = 0.0000
Log
likelihood = -2160.4953
Pseudo R2 =
0.0658
------------------------------------------------------------------------------
NONDOCCO
| Coef. Std. Err.
z P>|z| [95%
Conf. Interval]
---------+--------------------------------------------------------------------
SEX | .2308069 .1246663
1.851 0.064 -.0135347
.4751484
AGE | -2.675554 2.430949 -1.101
0.271 -7.440127 2.08902
AGE2 | 3.854298 2.62543
1.468 0.142 -1.291451
9.000047
INCOME | -.0621359 .1906016 -0.326
0.744 -.4357083 .3114364
LEVYPLUS
| .2986752 .1583588
1.886 0.059 -.0117024
.6090528
FREEPOOR
| -.196811 .3515133 -0.560
0.576 -.8857645 .4921425
FREEREPA
| .5877179 .2185872
2.689 0.007 .1592949
1.016141
ILLNESS
| .1443791 .0467823
3.086 0.002 .0526875
.2360708
ACTDAYS
| .1370558 .0170753
8.027 0.000 .1035888
.1705228
HSCORE | .0739655 .0279625
2.645 0.008 .0191601
.1287709
CHCOND1
| .4115285 .142977
2.878 0.004 .1312987
.6917583
CHCOND2
| 1.124148 .1830997
6.140 0.000 .7652796
1.483017
_cons | -2.783845 .4351314 -6.398
0.000 -3.636687 -1.931003
---------+--------------------------------------------------------------------
/lnalpha
| 2.187067 .0758164
2.038469 2.335664
---------+--------------------------------------------------------------------
alpha | 8.90904 .6754515
13.190 0.000 7.678844
10.33632
------------------------------------------------------------------------------
Likelihood
ratio test of alpha=0: chi2(1) = 1897.75
Prob > chi2 = 0.0000
Appendix
A: Count Data Models in R
The first thing you need
to do is to download the package "foreign", such that you can translate
the data set (helth.dta) into ascii format. You can do that as follows:
install.packages("foreign")
library(foreign)
health<-read.dta("C:/health.dta")
summary(health)
Next you need to extract
the variables from the data frame, and generate the variable age-squared:
NONDOCCO<-health$NONDOCCO
SEX<-health$SEX
AGE<-health$AGE
INCOME<-health$INCOME
LEVYPLUS<-health$LEVYPLUS
FREEPOOR<-health$FREEPOOR
FREEREPA<-health$FREEREPA
ILLNESS<-health$ILLNESS
ACTDAYS<-health$ACTDAYS
HSCORE<-health$HSCORE
CHCOND1<-health$CHCOND1
CHCOND2<-health$CHCOND2
AGE2<-AGE^2
Finally you can run the
generalized linear models of your choice. Here I will show how to run Poisson
Regression and Negative Binomial Regression:
Poisson Regression
help(glm)
model1<-glm(NONDOCCO~SEX+AGE+AGE2+INCOME+LEVYPLUS+FREEPOOR+FREEREPA+
ILLNESS+ACTDAYS+HSCORE+CHCOND1+CHCOND2,
family=poisson)
summary(model1)
Call:
glm(formula
= NONDOCCO ~ SEX + AGE + AGE2 + INCOME + LEVYPLUS +
FREEPOOR + FREEREPA + ILLNESS + ACTDAYS + HSCORE + CHCOND1 +
CHCOND2, family = poisson)
Deviance
Residuals:
Min 1Q Median
3Q Max
-2.6134
-0.6432 -0.4722 -0.3720 7.6330
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-2.443790 0.238863 -10.231 < 2e-16 ***
SEX
0.331540 0.069182 4.792 1.65e-06 ***
AGE
-3.306751 1.220280 -2.710 0.006732 **
AGE2
4.388976 1.293001 3.394 0.000688 ***
INCOME
-0.035329 0.110904 -0.319 0.750065
LEVYPLUS
0.327930 0.097397 3.367 0.000760 ***
FREEPOOR
0.015368 0.210538 0.073 0.941810
FREEREPA
0.482330 0.115504 4.176 2.97e-05 ***
ILLNESS
0.054750 0.021383 2.560 0.010455 *
ACTDAYS
0.097933 0.006041 16.212 < 2e-16 ***
HSCORE
0.044825 0.011521 3.891 0.000100 ***
CHCOND1
0.518584 0.086754 5.978 2.26e-09 ***
CHCOND2
1.078660 0.097878 11.021 < 2e-16 ***
---
Signif.
codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion
parameter for poisson family taken to be 1)
Null deviance: 6127.9 on 5189 degrees of freedom
Residual
deviance: 5040.9 on 5177 degrees of freedom
AIC:
6244.7
Number
of Fisher Scoring iterations: 5
Negative Binomial Regression
To run a Negative Binomial,
you need to download the library MASS:
library(MASS)
help(glm.nb)
model2<-glm.nb(NONDOCCO~SEX+AGE+AGE2+INCOME+LEVYPLUS+FREEPOOR+FREEREPA+ILLNESS
+ACTDAYS+HSCORE+CHCOND1+CHCOND2)
summary(model2)
Call:
glm.nb(formula
= NONDOCCO ~ SEX + AGE + AGE2 + INCOME + LEVYPLUS +
FREEPOOR + FREEREPA + ILLNESS + ACTDAYS + HSCORE + CHCOND1 +
CHCOND2, init.theta = 0.112257226738908, link = log)
Deviance
Residuals:
Min 1Q Median
3Q Max
-0.9973
-0.4692 -0.3647 -0.2960 3.5625
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-2.78263 0.42559 -6.538 6.22e-11 ***
SEX
0.23084 0.12586 1.834 0.06663 .
AGE
-2.68034 2.32877 -1.151 0.24974
AGE2
3.85835 2.55362 1.511 0.13081
INCOME
-0.06209 0.19609 -0.317 0.75153
LEVYPLUS
0.29899 0.15630 1.913 0.05577 .
FREEPOOR
-0.19661 0.34860 -0.564 0.57275
FREEREPA
0.58787 0.21242 2.767 0.00565 **
ILLNESS
0.14428 0.04471 3.227 0.00125 **
ACTDAYS
0.13702 0.01661 8.251 < 2e-16 ***
HSCORE
0.07390 0.02667 2.771 0.00559 **
CHCOND1
0.41161 0.14116 2.916 0.00355 **
CHCOND2
1.12361 0.18578 6.048 1.47e-09 ***
---
Signif.
codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion
parameter for Negative Binomial(0.1123) family taken to be 1)
Null deviance: 1783.3 on 5189 degrees of freedom
Residual
deviance: 1417.4 on 5177 degrees of freedom
AIC:
10414
Number
of Fisher Scoring iterations: 1
Correlation
of Coefficients:
( S AGE AGE2 IN L FREEP FREER IL AC H CHCOND1
SEX
1
AGE
+ 1
AGE2
+ B 1
INCOME
1
LEVYPLUS
1
FREEPOOR
1
FREEREPA
, 1
ILLNESS
1
ACTDAYS
1
HSCORE
1
CHCOND1
1
CHCOND2
.
attr(,"legend")
[1]
0 ` ' 0.3 `.' 0.6 `,' 0.8 `+' 0.9 `*' 0.95 `B' 1
Theta: 0.11226
Std. Err.: 0.00848
2
x log-likelihood: -4320.99100
That's all for today. I
hope it helps.
|