Applied Econometrics at the University of Illinois: e-Tutorial 12: Panel Data I

	Econ 508	Econometrics Group
Home \| Faculty \| Students \| Alumni \| Courses \| Research \| Reproducibility \| Lab \| Seminars \| Economics \| Statistics \| Fame

Applied Econometrics
Econ 508 - Fall 2007

e-Tutorial 12: Panel Data I - Basics

Welcome to the twelfth issue of e-Tutorial. Here I will talk about the basic fundamentals of panel data estimation techniques: from the organization of your panel data sets to the tests of fixed effects versus random effects. In the example below I will use the theoretical background of Prof. Koenker's Lecture Note 13 (2004) to reproduce the results of Greene (1997). I insert STATA estimation techniques (plus some comments) whenever necessary. I also provide a short introduction to panel data in R. Have fun!!!

Example:
Greene (1997) provides a small panel data set with information on costs and output of 6 different firms, in 4 different periods of time (1955, 1960,1965, and 1970). Your job is try to estimate a cost function using basic panel data techniques.

Stacking your data:
The data is shown below in a stacked form, i.e., the first "T" lines (here T=4) regard the firm 1, then the second "T" lines regard firm 2, and so on. The columns are self-explanatory. To facilitate your work, I included firm specific dummy variables for each firm, represented by columns D1-D6. The data is described below and available here in ASCII format for download.

Year	Firm	Cost	Output	D1	D2	D3	D4	D5	D6
1955	1	3.154	214	1	0	0	0	0	0
1960	1	4.271	419	1	0	0	0	0	0
1965	1	4.584	588	1	0	0	0	0	0
1970	1	5.849	1025	1	0	0	0	0	0
1955	2	3.859	696	0	1	0	0	0	0
1960	2	5.535	811	0	1	0	0	0	0
1965	2	8.127	1640	0	1	0	0	0	0
1970	2	10.966	2506	0	1	0	0	0	0
(...)	(...)	(...)	(...)	(...)	(...)	(...)	(...)	(...)	(...)
1955	6	73.050	11796	0	0	0	0	0	1
1960	6	98.846	15551	0	0	0	0	0	1
1965	6	138.880	27218	0	0	0	0	0	1
1970	6	191.560	30958	0	0	0	0	0	1

Save the data in your preferred path (I will save mine as "C:/econ508/greene.txt") and open your preferred software.

In R:
The Appendix A contains a panel data session in R with the main results derived in this tutorial.

In STATA:
The first step is to download your data into the software:

infile Year Firm Cost Output D1 D2 D3 D4 D5 D6 using "C:/econ508/greene14.txt"

Drop the first line of observations containing missing values (due to the labels of variables in the text file).

The next step is to generate the log values of costs and outputs:
gen lnc=log(Cost)
gen lny=log(Output)

Finally you declare your data set as panel:
iis Firm
tis Year
where iis refers to the cross-sectional unit identification, and tis to the time series identification.

Theoretical Background:

Consider a simplified version of the equation (1) in Koenker's Lecture 13:

(1) y_it= x_itb + a_i + u_it

a) Pooled OLS:

The most basic estimator of panel data sets are the Pooled OLS (POLS). Johnston & DiNardo (1997) recall that the POLS estimators ignore the panel structure of the data, treat observations as being serially uncorrelated for a given individual, with homoscedastic errors across individuals and time periods:

(2) b^POLS = (X'X)^-1X'y

In STATA, you can obtain the POLS as follows:
regress lnc lny

Source |       SS       df       MS                  Number of obs =      24
---------+------------------------------               F( 1,    22) = 728.51
   Model |   33.617333     1   33.617333               Prob > F      = 0.0000
Residual | 1.01520396    22 .046145635               R-squared     = 0.9707
---------+------------------------------               Adj R-squared = 0.9694
   Total | 34.6325369    23 1.50576248               Root MSE      = .21482

scalar R2OLS=_result(7)

b) Fixed Effects (Within-Groups) Estimators:

In Koenker's Lecture 13 we examined the effects of applying the matrix P and Q to the data:
P = D(D'D)^-1D': transform data into individual means
Q = I-P : transform data into deviation from individual means.

The within-groups (or fixed effects) estimator is then given by:

(3) b^W= (X'QX)^-1X'Qy

Given that Q is idempotent, this is equivalent to regressing Qy on QX, i.e., using data in the form of deviations from individuals means. In STATA, you can obtain the within-groups estimators using the built-in functionxtreg, fe:

xtreg lnc lny, fe
Fixed-effects (within) regression Number of obs = 24
Group variable (i) : Firm Number of groups = 6

R-sq: within = 0.8774                         Obs per group: min =         4
       between = 0.9833                                        avg =       4.0
       overall = 0.9707                                        max =         4

F(1,17) = 121.66
corr(u_i, Xb) = 0.8495 Prob > F = 0.0000

------------------------------------------------------------------------------
     lnc |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     lny |   .6742789   .0611307     11.030   0.000       .5453044    .8032534
   _cons | -2.399009    .508593     -4.717   0.000      -3.472046   -1.325972
------------------------------------------------------------------------------
sigma_u | .36730483
sigma_e | .12463167
     rho | .89675322   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0:     F(5,17) =     9.67                Prob > F = 0.0002

matrix bW=get(_b)
matrix VW=get(VCE)

Note: The intercept above shown is an average of individual intercepts. If you are interested in obtaining firm-specific intercepts, go to Appendix B.

Between-Groups Estimators:
Another useful estimator is provided when you use only the group means, i.e., transforming your data by applying the matrix P to equation (1) above:

(4) b^B= [X'PX]^-1X'Py

In STATA, you can obtain the between-groups estimators using the built-in function xtreg, be:

xtreg lnc lny, be
Between regression (regression on group means) Number of obs = 24
Group variable (i) : Firm Number of groups = 6

F(1,4) = 236.23
sd(u_i + avg(e_i.))= .1838474 Prob > F = 0.0001

matrix bB=get(_b)
matrix VB=get(VCE)

c) Random Effects:

Following Koenker's Lecture 13, consider a_i's as random. So, the model will be estimated via GLS:

(5) b^GLS= [X'Omega^-1X]^-1X'Omega^-1y

where Omega = (sigma_u²*I_nT + T*sigma_a²*P)

You can obtain GLS estimators in STATA by using the built-in functionxtreg, re:

xtreg lnc lny, re
Random-effects GLS regression Number of obs = 24
Group variable (i) : Firm Number of groups = 6

Random effects u_i ~ Gaussian Wald chi2(1) = 268.10
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

------------------------------------------------------------------------------
     lnc |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     lny |   .7963203   .0486336     16.374   0.000       .7010002    .8916404
   _cons | -3.413094   .4131166     -8.262   0.000      -4.222788     -2.6034
---------+--------------------------------------------------------------------
sigma_u | .17296414
sigma_e | .12463167
     rho | .65823599   (fraction of variance due to u_i)
------------------------------------------------------------------------------

GLS as a Combination of Within- and Between-Groups Estimators:
You can recover GLS estimator from the combination of between and within estimators, as shown in Koenker's Lecture 13:

(5.a) b^GLS= Delta* b^B + (1-Delta)* b^W

where Delta = V^W / (V^W + V^B)

^{In STATA,
you can recover random effects GLS estimators as follows:}

^{matrix
V=VW+VB}
^{matrix
Vinv=syminv(V)}
^{matrix
D=VW*Vinv}
^{matrix
P1=D*bB'}
^{matrix
I2=I(2)}
^{matrix
RD=I2-D}
^{matrix
P2=RD*bW'}
^{matrix
bRE=P1+P2}
^{matrix
list bRE}
^bRE[2,1]
^y1
^{lny .79632032}
^{_cons
-3.413094}

What should I use: Fixed Effects or Random Effects? A Hausman (1978) Test Approach

Hausman (1978) suggested a test to check whether the individual effects (a_i) are correlated with the regressors (X_it):

- Under the Null Hypothesis: Orthogonality, i.e., no correlation between individual effects and explanatory variables. Both random effects and fixed effects estimators are consistent, but the random effects estimator is efficient, while fixed effects is not.

- Under the Alternative Hypothesis: Individual effects are correlated with the X's. In this case, random effects estimator is inconsistent, while fixed effects estimator is consistent and efficient.

Greene (1997) recalls that, under the null, the estimates should not differ systematically. Thus, the test will be based on a contrast vecor H:

(6) H = [b^GLS- b^W]'[V(b^W)-V(b^GLS)]^-1[b^GLS- b^W] ~ Chi-squared (k)

where k is the number of regressors in X (excluding constant). In STATA, you can obtain that as follows:

xtreg lnc lny, fe
hausman, save
xtreg lnc lny, re
hausman

            ---- Coefficients ----
         |      (b)          (B)            (b-B)   sqrt(diag(V_b-V_B))
         |     Prior       Current       Difference        S.E.
---------+-------------------------------------------------------------
     lny |   .6742789     .7963203        -.1220414     .0370369
---------+-------------------------------------------------------------
            b = less efficient estimates obtained previously from xtreg.
            B = fully efficient estimates obtained from xtreg.

Test: Ho: difference in coefficients not systematic

            chi2( 1) = (b-B)'[(V_b-V_B)^(-1)](b-B)
                      =    10.86
            Prob>chi2 =     0.0010

So, based on the test above, we can see that the tests statistic (10.86) is greater than the critical value of a Chi-squared (1df, 5%) = 3.84. Therefore, we reject the null hypothesis. Given such result, the preferred model is the fixed effects.

Appendix A: Quick Session in R

The first thing to do is to download the data, save in your preferred directory (I will save mine as C:/econ508/greene14.txt), and infile the data into R:

greene14<-read.table("C:/econ472/greene14.txt", header=T)
greene14

Next you need to extract each variable from the data set:

year<-greene14$Year
firm<-greene14$Firm
cost<-greene14$Cost
output<-greene14$Output
d1<-greene14$D1
d2<-greene14$D2
d3<-greene14$D3
d4<-greene14$D4
d5<-greene14$D5
d6<-greene14$D6
summary(greene14)

And transform them into logs (usually you don't need to, but it will facilitate the use of panel functions later).

lnc<-log(cost)
lny<-log(output)

Finally, you will call the library MASS, to use the vcov function.

library(MASS)
help(vcov)

Pooled OLS
pols<-lm(log(cost)~log(output))
summary(pols)
anova(pols)
bpols<-cbind(coefficients(pols))
vcov.lm(pols)

Fixed Effects:
In order to obtain the fixed effects we need to transform the data into means and deviations from means. The function panmat.R, available at the Econ 508 webpage (Routines, panel.R), does such transformation. You can copy the function below and past on the R screen.

#Start copying here:

#This function computes matrices of means and deviations from means
#used by the panel2 function.
# Input: x = a matrix data indexed by id
# id = a factor variable indexing x
# Output: list containing: xm<-matrix of means
# xdm<-matrix of deviations from means.
"panmat"<-function(x,id)
{
x<-as.matrix(x)
id<-as.factor(id)
xm<- apply(x,2,function(y,z) tapply(y,z, mean), z=id)
xdm<- x-apply(xm, 2, function(y,z) rep(y,table(z)),z=id)
list(xm=xm, xdm=xdm)
}

#Finish copying here.

Next, you will extract the between and the within data:
lncwe<-panmat(lnc,firm)$xdm
lncbe<-panmat(lnc,firm)$xm
lnywe<-panmat(lny,firm)$xdm
lnybe<-panmat(lny,firm)$xm

#Fixed Effects (Within Estimators):
within<-lm(lncwe~lnywe-1)
summary(within)
bwe<-coefficients(within)
vwe<-vcov(within)

#Between Estimator:
between<-lm(lncbe~lnybe)
summary(between)
vbe<-vcov(between)
vbe

Appendix B: Recovering Alfas from Fixed Effects (Least Squares Dummy Variables)

Suppose you are interested in to obtain a specific regression for firm 3. E.g., many international economists need to find a country-specific equation when they are dealing with country panels. If you are in this situation, don't worry. The fixed effects estimators are already taking into account all individual effects. The only mysterious thing happening is that such individual intercepts are not being shown in the regression output. In the example above, the intercept shown in the fixed effects output is not specific to any firm. Instead, it is an average of all firms intercepts.

You can recover the intercept of your cross-sectional unit after using fixed effects estimators. For the example above, let's calculate the fixed effects model including dummy variables for each firm, instead of a common intercept (some authors call this Lest Squares Dummy Variables, but it is the same fixed effects you saw earlier). In STATA:

regress lnc lny D1 D2 D3 D4 D5 D6, noconst

Source |       SS       df       MS                  Number of obs =      24
---------+------------------------------               F( 7,    17) = 2581.72
   Model | 280.714267     7 40.1020382               Prob > F      = 0.0000
Residual | .264061918    17 .015533054               R-squared     = 0.9991
---------+------------------------------               Adj R-squared = 0.9987
   Total | 280.978329    24 11.7074304               Root MSE      = .12463

------------------------------------------------------------------------------
     lnc |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     lny |   .6742789   .0611307     11.030   0.000       .5453044    .8032534
      D1 | -2.693527   .3827874     -7.037   0.000      -3.501137   -1.885916
      D2 | -2.911731   .4395755     -6.624   0.000      -3.839154   -1.984308
      D3 | -2.439957   .5286852     -4.615   0.000      -3.555386   -1.324529
      D4 | -2.134488   .5587981     -3.820   0.001      -3.313449    -.955527
      D5 | -2.310839     .55325     -4.177   0.001      -3.478094   -1.143583
      D6 | -1.903512   .6080806     -3.130   0.006       -3.18645   -.6205737
------------------------------------------------------------------------------

The slope is obviously the same. The only change is the substitution of a common intercept for 6 dummies, each of them representing a cross-sectional unit.

Now suppose you would like to know if the difference in the firms effects is statistically significant. How to do that?

- Regress the fixed effects estimators above, including the intercept and the dummies:

regress lnc lny D1 D2 D3 D4 D5 D6
Source |       SS       df       MS                  Number of obs =      24
---------+------------------------------               F( 6,    17) = 368.77
   Model |   34.368475     6 5.72807917               Prob > F      = 0.0000
Residual | .264061918    17 .015533054               R-squared     = 0.9924
---------+------------------------------               Adj R-squared = 0.9897
   Total | 34.6325369    23 1.50576248               Root MSE      = .12463

------------------------------------------------------------------------------
     lnc |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     lny |   .6742789   .0611307     11.030   0.000       .5453044    .8032534
      D1 | (dropped)
      D2 | -.2182041   .1052027     -2.074   0.054      -.4401624    .0037542
      D3 |   .2535693   .1716665      1.477   0.158      -.1086153    .6157538
      D4 |   .5590387   .1982915      2.819   0.012       .1406801    .9773973
      D5 |   .3826881   .1933058      1.980   0.064      -.0251516    .7905277
      D6 |   .7900151   .2436915      3.242   0.005        .275871    1.304159
   _cons | -2.693527   .3827874     -7.037   0.000      -3.501137   -1.885916
------------------------------------------------------------------------------

Note that one of the dummies is dropped (due to perfect collinearity of the constant), and all other dummies are represented as the difference between their original value and the constant . (The value of the constant in this second regression equals the value of the dropped dummy in the previous regression. The dropped dummy is seen as the benchmark.)

- Obtain the R-squared from restricted (POLS) and unresctricted (fixed effects with dummies) models

scalar R2LSDV=_result(7)
scalar list
R2OLS = .97068641
R2LSDV = .99237532

- Perform the traditional F-test, comparing the unrestricted regression with the restricted regression:
(7) F_{(n-1, nT-n-K)}=[ (R_u² - R_p²) / (n-1) ] / [ (1 - R_u²) / (nT - n - k) ]

where the subscript "u" refers to the unrestricted regression (fixed effects with dummies), and the subscript "p" to the restricted regression (POLS). Under the null hypothesis, POLS are more efficient.

scalar F=((R2LSDV-R2OLS)/(6-1))/((1-R2LSDV)/(24-6-1))
scalar list F
F = 9.6715307

The result above can be compared with the critical value of F(5,17), which equals 4.34 at 1% level. Therefore, we reject the null hypothesis of common intercept for all firms.

References:
Greene, William, 1997, Econometric Analysis, Third Edition, NJ: Prentice-Hall.
Hausman, Jerry, 1978, "Specification Tests in Econometrics," Econometrica, 46, pp.1251-1271.
Johnston, Jack, and John DiNardo, 1997, Econometric Methods, Fourth Edition, NY: McGraw-Hill.
Koenker, Roger, 2004, "Panel Data," Lecture 13, mimeo, University of Illinois at Urbana-Champaign.

Last update: November 6, 2007