Econ 508
Econometrics Group
Home | Faculty | Students | Alumni | Courses | Research | Reproducibility | Lab | Seminars | Economics | Statistics | Fame
Applied Econometrics
Econ 508 - Fall 2007

e-Tutorial 12: Panel Data I - Basics

Welcome to the twelfth issue of e-Tutorial. Here I will talk about the basic fundamentals of panel data estimation techniques: from the organization of your panel data sets to the tests of fixed effects versus random effects. In the example below I will use the theoretical background of Prof. Koenker's Lecture Note 13 (2004) to reproduce the results of Greene (1997). I insert STATA estimation techniques (plus some comments) whenever necessary. I also provide a short introduction to panel data in R. Have fun!!!

Example:
Greene (1997) provides a small panel data set with information on costs and output of 6 different firms, in 4 different periods of time (1955, 1960,1965, and 1970). Your job is try to estimate a cost function using basic panel data techniques.

Stacking your data:
The data is shown below in a stacked form, i.e., the first "T" lines (here T=4) regard the firm 1, then the second "T" lines regard firm 2, and so on. The columns are self-explanatory. To facilitate your work, I included firm specific dummy variables for each firm, represented by columns D1-D6. The data is described below and available here in ASCII format for download.

Year Firm Cost Output D1 D2 D3 D4 D5
D6
1955
1
3.154
214
1
0
0
0
0
0
1960
1
4.271
419
1
0
0
0
0
0
1965
1
4.584
588
1
0
0
0
0
0
1970
1
5.849
1025
1
0
0
0
0
0
1955
2
3.859
696
0
1
0
0
0
0
1960
2
5.535
811
0
1
0
0
0
0
1965
2
8.127
1640
0
1
0
0
0
0
1970
2
10.966
2506
0
1
0
0
0
0
(...)
(...)
(...)
(...)
(...)
(...)
(...)
(...)
(...)
(...)
1955
6
73.050
11796
0
0
0
0
0
1
1960
6
98.846
15551
0
0
0
0
0
1
1965
6
138.880
27218
0
0
0
0
0
1
1970
6
191.560
30958
0
0
0
0
0
1

Save the data in your preferred path (I will save mine as "C:/econ508/greene.txt") and open your preferred software.

In  R:
The Appendix A contains a panel data session in R with the main results derived in this tutorial.

In STATA:
The first step is to download your data into the software:

infile Year Firm Cost Output D1 D2 D3 D4 D5 D6 using "C:/econ508/greene14.txt"

Drop the first line of observations containing missing values (due to the labels of variables in the text file).

The next step is to generate the log values of costs and outputs:
gen lnc=log(Cost)
gen lny=log(Output)

Finally you declare your data set as panel:
iis Firm
tis Year
where iis refers to the cross-sectional unit identification, and tis to the time series identification.

Theoretical Background:

Consider a simplified version of the equation (1) in Koenker's Lecture 13:

(1)    yit = xitb + ai + uit

a) Pooled OLS:

The most basic estimator of panel data sets are the Pooled OLS (POLS). Johnston & DiNardo (1997) recall that the POLS estimators ignore the panel structure of the data, treat observations as being serially uncorrelated for a given individual, with homoscedastic errors across individuals and time periods:

(2)    bPOLS = (X'X)-1X'y

In STATA, you can obtain the POLS as follows:
regress lnc lny

  Source |       SS       df       MS                  Number of obs =      24
---------+------------------------------               F(  1,    22) =  728.51
   Model |   33.617333     1   33.617333               Prob > F      =  0.0000
Residual |  1.01520396    22  .046145635               R-squared     =  0.9707
---------+------------------------------               Adj R-squared =  0.9694
   Total |  34.6325369    23  1.50576248               Root MSE      =  .21482

------------------------------------------------------------------------------
     lnc |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     lny |   .8879868   .0328996     26.991   0.000       .8197573    .9562164
   _cons |  -4.174783   .2768684    -15.079   0.000      -4.748973   -3.600593
------------------------------------------------------------------------------

scalar R2OLS=_result(7)
 

b) Fixed Effects (Within-Groups) Estimators:

In Koenker's Lecture 13 we examined the effects of applying the matrix P and Q to the data:
P  = D(D'D)-1D': transform data into individual means
Q = I-P : transform data into deviation from individual means.

The within-groups (or fixed effects) estimator is then given by:

(3)  bW = (X'QX)-1X'Qy

Given that Q is idempotent, this is equivalent to regressing Qy on QX, i.e., using data in the form of deviations from individuals means. In STATA, you can obtain the within-groups estimators using the built-in functionxtreg, fe:

xtreg  lnc lny, fe
Fixed-effects (within) regression               Number of obs      =        24
Group variable (i) : Firm                       Number of groups   =         6

R-sq:  within  = 0.8774                         Obs per group: min =         4
       between = 0.9833                                        avg =       4.0
       overall = 0.9707                                        max =         4

                                                F(1,17)            =    121.66
corr(u_i, Xb)  = 0.8495                         Prob > F           =    0.0000

------------------------------------------------------------------------------
     lnc |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     lny |   .6742789   .0611307     11.030   0.000       .5453044    .8032534
   _cons |  -2.399009    .508593     -4.717   0.000      -3.472046   -1.325972
------------------------------------------------------------------------------
 sigma_u |  .36730483
 sigma_e |  .12463167
     rho |  .89675322   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0:     F(5,17) =     9.67                Prob > F = 0.0002

matrix bW=get(_b)
matrix VW=get(VCE)

Note: The intercept above shown is an average of individual intercepts. If you are interested in obtaining firm-specific intercepts, go to Appendix B.
 

Between-Groups Estimators:
Another useful estimator is provided when you use only the group means, i.e., transforming your data by applying the matrix P to equation (1) above:

(4)  bB = [X'PX]-1X'Py
 

In STATA, you can obtain the between-groups estimators using the built-in function xtreg, be:

xtreg  lnc lny, be
Between regression (regression on group means)  Number of obs      =        24
Group variable (i) : Firm                       Number of groups   =         6

R-sq:  within  = 0.8774                         Obs per group: min =         4
       between = 0.9833                                        avg =       4.0
       overall = 0.9707                                        max =         4

                                                F(1,4)             =    236.23
sd(u_i + avg(e_i.))=  .1838474                  Prob > F           =    0.0001

------------------------------------------------------------------------------
     lnc |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     lny |   .9110734   .0592772     15.370   0.000       .7464935    1.075653
   _cons |  -4.366618   .4982409     -8.764   0.001      -5.749957   -2.983279
------------------------------------------------------------------------------

matrix bB=get(_b)
matrix VB=get(VCE)
 

c) Random Effects:

Following Koenker's Lecture 13, consider ai's as random. So, the model will be estimated via GLS:

(5) bGLS = [X'Omega-1X]-1X'Omega-1y

where Omega = (sigmau2*InT + T*sigmaa2*P)

You can obtain GLS estimators in STATA by using the built-in functionxtreg, re:

xtreg  lnc lny, re
Random-effects GLS regression                   Number of obs      =        24
Group variable (i) : Firm                       Number of groups   =         6

R-sq:  within  = 0.8774                         Obs per group: min =         4
       between = 0.9833                                        avg =       4.0
       overall = 0.9707                                        max =         4

Random effects u_i ~ Gaussian                   Wald chi2(1)       =    268.10
corr(u_i, X)       = 0 (assumed)                Prob > chi2        =    0.0000

------------------------------------------------------------------------------
     lnc |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     lny |   .7963203   .0486336     16.374   0.000       .7010002    .8916404
   _cons |  -3.413094   .4131166     -8.262   0.000      -4.222788     -2.6034
---------+--------------------------------------------------------------------
 sigma_u |  .17296414
 sigma_e |  .12463167
     rho |  .65823599   (fraction of variance due to u_i)
------------------------------------------------------------------------------
 

GLS as a Combination of Within- and Between-Groups Estimators:
You can recover GLS estimator from the combination of between and within estimators, as shown in Koenker's Lecture 13:

(5.a)    bGLS = Delta* bB + (1-Delta)* bW

where       Delta = VW / (VW + VB)

In STATA, you can recover random effects GLS estimators as follows:

matrix V=VW+VB
matrix Vinv=syminv(V)
matrix D=VW*Vinv
matrix P1=D*bB'
matrix I2=I(2)
matrix RD=I2-D
matrix P2=RD*bW'
matrix bRE=P1+P2
matrix list bRE
bRE[2,1]
              y1
  lny  .79632032
_cons  -3.413094
 

What should I use: Fixed Effects or Random Effects? A Hausman (1978) Test Approach

Hausman (1978) suggested a test to check whether the individual effects (ai) are correlated with the regressors (Xit):

- Under the Null Hypothesis: Orthogonality, i.e., no correlation between individual effects and explanatory variables. Both random effects and fixed effects estimators are consistent, but the random effects estimator is efficient, while fixed effects is not.

- Under the Alternative Hypothesis: Individual effects are correlated with the X's. In this case, random effects estimator is inconsistent, while fixed effects estimator is consistent and efficient.

Greene (1997) recalls that, under the null, the estimates should not differ systematically. Thus, the test will be based on a contrast vecor H:

(6)    H = [bGLS - bW]'[V(bW)-V(bGLS)]-1[bGLS - bW]  ~ Chi-squared (k)

where k is the number of regressors in X (excluding constant). In STATA, you can obtain that as follows:

xtreg  lnc lny, fe
hausman, save
xtreg  lnc lny, re
hausman

            ---- Coefficients ----
         |      (b)          (B)            (b-B)   sqrt(diag(V_b-V_B))
         |     Prior       Current       Difference        S.E.
---------+-------------------------------------------------------------
     lny |   .6742789     .7963203        -.1220414     .0370369
---------+-------------------------------------------------------------
            b = less efficient estimates obtained previously from xtreg.
            B = fully efficient estimates obtained from xtreg.

Test:  Ho:  difference in coefficients not systematic

            chi2(  1) = (b-B)'[(V_b-V_B)^(-1)](b-B)
                      =    10.86
            Prob>chi2 =     0.0010

So, based on the test above, we can see that the tests statistic (10.86) is greater than the critical value of a Chi-squared (1df, 5%) = 3.84. Therefore, we reject the null hypothesis. Given such result, the preferred model is the fixed effects.
 

Appendix A: Quick Session in R

The first thing to do is to download the data, save in your preferred directory (I will save mine as C:/econ508/greene14.txt), and infile the data into R:

greene14<-read.table("C:/econ472/greene14.txt", header=T)
greene14

Next you need to extract each variable from the data set:

year<-greene14$Year
firm<-greene14$Firm
cost<-greene14$Cost
output<-greene14$Output
d1<-greene14$D1
d2<-greene14$D2
d3<-greene14$D3
d4<-greene14$D4
d5<-greene14$D5
d6<-greene14$D6
summary(greene14)

And transform them into logs (usually you don't need to, but it will facilitate the use of panel functions later).

lnc<-log(cost)
lny<-log(output)

Finally, you will call the library MASS, to use the vcov function.

library(MASS)
help(vcov)

Pooled OLS
pols<-lm(log(cost)~log(output))
summary(pols)
anova(pols)
bpols<-cbind(coefficients(pols))
vcov.lm(pols)

Fixed Effects:
In order to obtain the fixed effects we need to transform the data into means and deviations from means. The function panmat.R, available at the Econ 508 webpage (Routines, panel.R), does such transformation. You can copy the function below and past on the R screen.

#Start copying here:

#This function computes matrices of means and deviations from means
#used by the panel2 function.
# Input: x = a matrix data indexed by id
#       id = a factor variable indexing x
# Output: list containing: xm<-matrix of means
#                          xdm<-matrix of deviations from means.
"panmat"<-function(x,id)
  {
  x<-as.matrix(x)
  id<-as.factor(id)
  xm<- apply(x,2,function(y,z) tapply(y,z, mean), z=id)
  xdm<- x-apply(xm, 2, function(y,z) rep(y,table(z)),z=id)
  list(xm=xm, xdm=xdm)
  }

#Finish copying here.

Next, you will extract the between and the within data:
lncwe<-panmat(lnc,firm)$xdm
lncbe<-panmat(lnc,firm)$xm
lnywe<-panmat(lny,firm)$xdm
lnybe<-panmat(lny,firm)$xm

#Fixed Effects (Within Estimators):
within<-lm(lncwe~lnywe-1)
summary(within)
bwe<-coefficients(within)
vwe<-vcov(within)

#Between Estimator:
between<-lm(lncbe~lnybe)
summary(between)
vbe<-vcov(between)
vbe
 

Appendix B: Recovering Alfas from Fixed Effects (Least Squares Dummy Variables)

Suppose you are interested in to obtain a specific regression for firm 3. E.g., many international economists need to find a country-specific equation when they are dealing with country panels. If you are in this situation, don't worry. The fixed effects estimators are already taking into account all individual effects. The only mysterious thing happening is that such individual intercepts are not being shown in the regression output. In the example above, the intercept shown in the fixed effects output is not specific to any firm. Instead, it is an average of all firms intercepts.

You can recover the intercept of your cross-sectional unit after using fixed effects estimators. For the example above, let's calculate the fixed effects model including dummy variables for each firm, instead of a common intercept (some authors call this Lest Squares Dummy Variables, but it is the same fixed effects you saw earlier). In STATA:

regress lnc lny D1 D2 D3 D4 D5 D6, noconst

  Source |       SS       df       MS                  Number of obs =      24
---------+------------------------------               F(  7,    17) = 2581.72
   Model |  280.714267     7  40.1020382               Prob > F      =  0.0000
Residual |  .264061918    17  .015533054               R-squared     =  0.9991
---------+------------------------------               Adj R-squared =  0.9987
   Total |  280.978329    24  11.7074304               Root MSE      =  .12463

------------------------------------------------------------------------------
     lnc |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     lny |   .6742789   .0611307     11.030   0.000       .5453044    .8032534
      D1 |  -2.693527   .3827874     -7.037   0.000      -3.501137   -1.885916
      D2 |  -2.911731   .4395755     -6.624   0.000      -3.839154   -1.984308
      D3 |  -2.439957   .5286852     -4.615   0.000      -3.555386   -1.324529
      D4 |  -2.134488   .5587981     -3.820   0.001      -3.313449    -.955527
      D5 |  -2.310839     .55325     -4.177   0.001      -3.478094   -1.143583
      D6 |  -1.903512   .6080806     -3.130   0.006       -3.18645   -.6205737
------------------------------------------------------------------------------

The slope is obviously the same. The only change is the substitution of a common intercept for 6 dummies, each of them representing a cross-sectional unit.

Now suppose you would like to know if the difference in the firms effects is statistically significant. How to do that?

- Regress the fixed effects estimators above, including the intercept and the dummies:

regress lnc lny D1 D2 D3 D4 D5 D6
  Source |       SS       df       MS                  Number of obs =      24
---------+------------------------------               F(  6,    17) =  368.77
   Model |   34.368475     6  5.72807917               Prob > F      =  0.0000
Residual |  .264061918    17  .015533054               R-squared     =  0.9924
---------+------------------------------               Adj R-squared =  0.9897
   Total |  34.6325369    23  1.50576248               Root MSE      =  .12463

------------------------------------------------------------------------------
     lnc |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     lny |   .6742789   .0611307     11.030   0.000       .5453044    .8032534
      D1 |  (dropped)
      D2 |  -.2182041   .1052027     -2.074   0.054      -.4401624    .0037542
      D3 |   .2535693   .1716665      1.477   0.158      -.1086153    .6157538
      D4 |   .5590387   .1982915      2.819   0.012       .1406801    .9773973
      D5 |   .3826881   .1933058      1.980   0.064      -.0251516    .7905277
      D6 |   .7900151   .2436915      3.242   0.005        .275871    1.304159
   _cons |  -2.693527   .3827874     -7.037   0.000      -3.501137   -1.885916
------------------------------------------------------------------------------

Note that one of the dummies is dropped (due to perfect collinearity of the constant), and all other dummies are represented as the difference between their original value and the constant . (The value of the constant in this second regression equals the value of the dropped dummy in the previous regression. The dropped dummy is seen as the benchmark.)

- Obtain the R-squared from restricted (POLS) and unresctricted (fixed effects with dummies) models

scalar R2LSDV=_result(7)
scalar list
  R2OLS  =  .97068641
    R2LSDV =  .99237532

- Perform the traditional F-test, comparing the unrestricted regression with the restricted regression:
(7)    F(n-1, nT-n-K)=[ (Ru2 - Rp2) / (n-1) ] / [ (1 - Ru2) / (nT - n - k) ]

where the subscript "u" refers to the unrestricted regression (fixed effects with dummies), and the subscript "p" to the restricted regression (POLS). Under the null hypothesis, POLS are more efficient.

scalar F=((R2LSDV-R2OLS)/(6-1))/((1-R2LSDV)/(24-6-1))
scalar list F
         F =  9.6715307

The result above can be compared with the critical value of F(5,17), which equals 4.34 at 1% level. Therefore, we reject the null hypothesis of  common intercept for all firms.

References:
Greene, William, 1997, Econometric Analysis, Third Edition, NJ: Prentice-Hall.
Hausman, Jerry, 1978, "Specification Tests in Econometrics," Econometrica, 46, pp.1251-1271.
Johnston, Jack, and John DiNardo, 1997, Econometric Methods, Fourth Edition, NY: McGraw-Hill.
Koenker, Roger, 2004, "Panel Data," Lecture 13, mimeo, University of Illinois at Urbana-Champaign.
 

 Last update: November 6, 2007