# e-TA 10: Panel Data: Basics

Welcome to a new issue of e-Tutorial. Here we will talk about the basic fundamentals of panel data estimation techniques: from the organization of your panel data sets to the tests of fixed effects versus random effects. In the example below we will use the theoretical background of Prof. Koenker’s Lecture Note 17 to reproduce the results of Greene (1997). ^{1}

# Data

The first thing you need is to download Greene’s (1997) panel data set, called greene14.txt from the Econ 508 web site. Save it in your preferred directory. This is a small panel data set with information on costs and output of 6 different firms, in 4 different periods of time (1955, 1960,1965, and 1970). Your job is try to estimate a cost function using basic panel data techniques.

The next step is loading the Data in R. If you saved it in your hard drive you can load it by typing:

` greene14<-read.table("C:/econ508/greene14.txt", header=T, sep="")`

or you can retrieve it from the web

` greene14<-read.table("http://www.econ.uiuc.edu/~econ508/data/greene14.txt", header=T, sep="")`

Next you need to extract each variable from the data set:

```
year<-greene14$Year
firm<-greene14$Firm
cost<-greene14$Cost
output<-greene14$Output
d1<-greene14$D1
d2<-greene14$D2
d3<-greene14$D3
d4<-greene14$D4
d5<-greene14$D5
d6<-greene14$D6
```

And transform them into logs (usually you don’t need to, but it will facilitate the use of panel functions later).

```
lnc<-log(cost)
lny<-log(output)
```

Finally, you will call the library `MASS`

, to use the `vcov`

function.

` library(MASS) `

# Pooled OLS

Consider a simplified version of the equation (1) in Koenker’s Lecture 17:

\[ y_{it} = x_{it} \beta + a_{i} + u_{it} \; \; \; \;(1) \]

The most basic estimator of panel data sets are the Pooled OLS (POLS). Johnston & DiNardo (1997) recall that the POLS estimators ignore the panel structure of the data, treat observations as being serially uncorrelated for a given individual, with homoscedastic errors across individuals and time periods:

\[ \beta_{POLS} = (X'X)^{-1} X'y \; \; \; \;(2) \]

In R you estimate it doing:

```
pols<-lm(log(cost)~log(output))
summary(pols)
```

```
Call:
lm(formula = log(cost) ~ log(output))
Residuals:
Min 1Q Median 3Q Max
-0.38020 -0.10535 -0.00448 0.10237 0.55854
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.1748 0.2769 -15.08 4.42e-13 ***
log(output) 0.8880 0.0329 26.99 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2148 on 22 degrees of freedom
Multiple R-squared: 0.9707, Adjusted R-squared: 0.9694
F-statistic: 728.5 on 1 and 22 DF, p-value: < 2.2e-16
```

# Fixed Effects (Within-Groups) Estimators:

In Prof. Koenker’s Lecture 17 we examined the effects of applying the matrix P and Q to the data, where \[ P = D(D'D)^{-1} D' \] transform data into individual means and

\[ Q = I-P \]

transform data into deviation from individual means.

The within-groups (or fixed effects) estimator is then given by:

\[ \beta_{Whitin} = (X'QX)^{-1} X'Qy \; \; \; \;(3) \]

Given that Q is idempotent, this is equivalent to regressing Qy on QX, i.e., using data in the form of deviations from individuals means. In R you to obtain the fixed effects estimates we need first to transform the data into means and deviations from means. The function PQ.R, available at the Econ 508 webpage (Routines, panel.R), does such transformation.

```
"PQ" <-function(h, id){
if(is.vector(h)) h <- matrix(h, ncol = 1)
Ph <- unique(id)
Ph <- cbind(Ph, table(id))
for(i in 1:ncol(h)) Ph <- cbind(Ph, tapply(h[, i], id, mean))
is <- tapply(id, id)
Ph <- Ph[is, - (1:2)]
Qh <- h - Ph
list(Ph=as.matrix(Ph), Qh=as.matrix(Qh), is=is)
}
```

Next, you will compute the deviations from the mean :

```
lnc_we<-PQ(lnc,firm)$Qh
lny_we<-PQ(lny,firm)$Qh
```

Then the Fixed Effects (Within Estimators), will be:

```
within<-lm(lnc_we~lny_we-1)
summary(within)
```

```
Call:
lm(formula = lnc_we ~ lny_we - 1)
Residuals:
Min 1Q Median 3Q Max
-0.214606 -0.061549 -0.006332 0.068760 0.224034
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lny_we 0.67428 0.05256 12.83 5.75e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1071 on 23 degrees of freedom
Multiple R-squared: 0.8774, Adjusted R-squared: 0.8721
F-statistic: 164.6 on 1 and 23 DF, p-value: 5.754e-12
```

You can obtain the same results by using the plm package:

` library(plm)`

`Loading required package: Formula`

```
within_plm<-plm(log(cost)~log(output), data=greene14, model="within", index = c("Firm","Year"))
summary(within_plm)
```

```
Oneway (individual) effect Within Model
Call:
plm(formula = log(cost) ~ log(output), data = greene14, model = "within",
index = c("Firm", "Year"))
Balanced Panel: n=6, T=4, N=24
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-0.21500 -0.06150 -0.00633 0.06880 0.22400
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
log(output) 0.674279 0.061131 11.03 3.612e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 2.1539
Residual Sum of Squares: 0.26406
R-Squared : 0.8774
Adj. R-Squared : 0.62149
F-statistic: 121.664 on 1 and 17 DF, p-value: 3.6118e-09
```

# Between-Groups Estimators:

Another useful estimator is provided when you use only the group means, i.e., transforming your data by applying the matrix P to equation (1) above:

\[ \beta_{Between} = [X'PX]^{-1} X'Py \; \; \; \;(4) \]

In R

```
lnc_be<-PQ(lnc,firm)$Ph
lny_be<-PQ(lny,firm)$Ph
between<-lm(lnc_be~lny_be)
summary(between)
```

```
Call:
lm(formula = lnc_be ~ lny_be)
Residuals:
Min 1Q Median 3Q Max
-0.230647 -0.106968 0.003721 0.120059 0.210114
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.36662 0.21245 -20.55 7.52e-16 ***
lny_be 0.91107 0.02528 36.05 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1568 on 22 degrees of freedom
Multiple R-squared: 0.9833, Adjusted R-squared: 0.9826
F-statistic: 1299 on 1 and 22 DF, p-value: < 2.2e-16
```

with the plm package:

```
between_plm<-plm(log(cost)~log(output), data=greene14, model="between", index = c("Firm","Year"))
summary(between_plm)
```

```
Oneway (individual) effect Between Model
Call:
plm(formula = log(cost) ~ log(output), data = greene14, model = "between",
index = c("Firm", "Year"))
Balanced Panel: n=6, T=4, N=24
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-0.23100 -0.09860 0.00372 0.11000 0.21000
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
(Intercept) -4.366618 0.498241 -8.7641 0.0009344 ***
log(output) 0.911073 0.059277 15.3697 0.0001046 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 8.1197
Residual Sum of Squares: 0.1352
R-Squared : 0.98335
Adj. R-Squared : 0.65557
F-statistic: 236.228 on 1 and 4 DF, p-value: 0.00010455
```

# Random Effects:

Following Prof. Koenker’s Lecture 17, consider \(a_{i}'s\) as random. So, the model can be estimated via GLS:

\[ \beta_{GLS} = [X' \Omega^{-1} X]^{-1}X'\Omega^{-1} y \; \; \; \; (5) \]

where \(\Omega = (\sigma_{u} ^{2}*I_{nT} + T \sigma_{\alpha} ^{2} P)\)

Estimating this with the plm package is straight forward:

```
random_plm<-plm(log(cost)~log(output), data=greene14, model="random", index = c("Firm","Year"))
summary(random_plm)
```

```
Oneway (individual) effect Random Effect Model
(Swamy-Arora's transformation)
Call:
plm(formula = log(cost) ~ log(output), data = greene14, model = "random",
index = c("Firm", "Year"))
Balanced Panel: n=6, T=4, N=24
Effects:
var std.dev share
idiosyncratic 0.01553 0.12463 0.342
individual 0.02992 0.17296 0.658
theta: 0.661
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-0.318000 -0.082400 -0.000581 0.073000 0.311000
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
(Intercept) -3.413094 0.413117 -8.2618 3.444e-08 ***
log(output) 0.796320 0.048634 16.3739 8.318e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 5.8853
Residual Sum of Squares: 0.44631
R-Squared : 0.92417
Adj. R-Squared : 0.84715
F-statistic: 268.104 on 1 and 22 DF, p-value: 8.3178e-14
```

An alternative approach would be using Nerlove’s Lemma in Lecture Notes 17 where we transform the model to obtain spherical errors. For example we transform \(y\) by

\[ \sigma_{u} \Omega^{-1/2} y = (\theta P + Q) y = y -(1- \theta) \bar{y} \]

where \(\theta=\frac{\sigma_{u}}{(\sigma_{u} ^2 + T \sigma_{\alpha} ^2)^{1/2}}\) and similarly for the other variables. In R:

```
eu <- within$resid - mean(within$resid)
eb <- between$resid
sig_u2 <- var(eu)
sig_a2 <- var(eb) - (1/4)*var(eu)
theta <- sqrt(sig_u2/(sig_u2 + 4*sig_a2))
lnc_bar<-rep(tapply(lnc,firm,mean),each=4)
lny_bar<-rep(tapply(lny,firm,mean),each=4)
lnc_G <- lnc - (1-theta)*lnc_bar
lny_G <- lny - (1-theta)*lny_bar
gls <- lm(lnc_G ~ lny_G)
summary(gls)
```

```
Call:
lm(formula = lnc_G ~ lny_G)
Residuals:
Min 1Q Median 3Q Max
-0.31871 -0.08382 -0.00044 0.07311 0.31671
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.20288 0.14315 -8.403 2.59e-08 ***
lny_G 0.79990 0.04826 16.573 6.50e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1436 on 22 degrees of freedom
Multiple R-squared: 0.9258, Adjusted R-squared: 0.9225
F-statistic: 274.7 on 1 and 22 DF, p-value: 6.496e-14
```

Note that the results do not exactly coincide with those given by the `plm`

package can you think why?

## GLS as a Combination of Within- and Between-Groups Estimators:

You can also recover GLS estimator from the combination of between and within estimators, as shown in Prof. Koenker’s Lecture 17 page 3:

\[ \beta_{GLS} = \Delta* \beta_{Between} + (1-\Delta)* \beta_{Within} \; \; \; \; (5.a) \]

where \(\Delta = V_{W} (V_{W} + V_{B})^{-1}\)

To recover the \(beta_{GLS}\) as a function of the *Between* and *Within* estimates we can take two approaches:

- The First one is a simple approach given the simple setting where we have only one independent variable. For this approach we use the
`vcov`

function of the`MASS`

package

```
Bb<-coef(summary(between_plm))[2,"Estimate"]
Vb<-vcov(between_plm)[2,2]
Bw<-coef(summary(within_plm))[,"Estimate"]
Vw<-vcov(within_plm)
beta.gls<-((1/Vb)+(1/Vw))^(-1)*(Bb/Vb+Bw/Vw)
beta.gls
```

```
log(output)
log(output) 0.7963203
```

And you compare it with the result given by the `plm`

package.

- A more general approach

The main issue when applying the HT decomposition is getting the dimensions of the variances matrices to coincide. Recall that \(\Delta = V_{W} (V_{W} + V_{B})^{-1}\), in this simple case the dimension of \(V_{W}\) is \(1x1\) and for \(V_{B}\) is \(2x2\). How to reconcile this fact? To do so we can expand \(V_{W}\) to match the dimension of \(V_{B}\) basically doing

\[ V_{W} ^{'}=\left( \begin{array}{cc} M I_{k_{1}} & 0 \\ 0 & V_{W} \end{array} \right) \]

were \(V_{W} ^{'}\) is a block diagonal matrix. \(M\) is a *bulgarian constant* a big number that basically tells that the *within* estimator has no information about the time invariant part. \(I_{k_{1}}\) is the identity matrix with the dimension of the time invariant part. In `R`

we can achive this by using the `Matrix`

package and doing:

```
library(Matrix)
Bb<-coef(summary(between_plm))[,"Estimate"]
Vb<-vcov(between_plm)
Bw<-coef(summary(within_plm))[,"Estimate"]
Vw<-vcov(within_plm)
MI<-999999
Vw_prime<-bdiag(matrix(MI),Vw)
Bw<-rbind(0,Bw)
delta<-Vw_prime%*%solve(Vw_prime+Vb)
beta.ht<-delta%*%Bb+(diag(dim(Vw_prime)[1])-delta)%*%Bw
beta.ht
```

```
2 x 1 Matrix of class "dgeMatrix"
[,1]
[1,] -3.4130937
[2,] 0.7963203
```

Again you can compare it with the results given by the `plm`

package.

# What should I use: Fixed Effects or Random Effects? A Hausman (1978) Test Approach

Hausman (1978) suggested a test to check whether the individual effects (\(a_{i}\)) are correlated with the regressors (\(X_{it}\)):

Under the Null Hypothesis: Orthogonality, i.e., no correlation between individual effects and explanatory variables. Both random effects and fixed effects estimators are consistent, but the random effects estimator is efficient, while fixed effects is not.

Under the Alternative Hypothesis: Individual effects are correlated with the X’s. In this case, random effects estimator is inconsistent, while fixed effects estimator is consistent and efficient.

Greene (1997) recalls that, under the null, the estimates should not differ systematically. Thus, the test will be based on a contrast vector H:

\[ H = [\beta_{GLS} - \beta_{W}]'[V(\beta_{W})-V(\beta_{GLS})]^{-1} [\beta_{GLS} - \beta_{W}] \sim \chi_{k} ^{2} \; \; \; \; (6) \]

where k is the number of regressors in X (excluding constant). In R

```
w1 <- coef(within_plm)-coef(random_plm)[2]
V1 <- vcov(within_plm) - vcov(random_plm)[2,2]
H1 <- (w1^2)/V1
H1
```

```
log(output)
log(output) 10.85787
```

So, based on the test above, we can see that the tests statistic 10.86 is greater than the critical value of a Chi-squared (1df, 5%) = 3.84. Therefore, we reject the null hypothesis. Given such result, the preferred model is the fixed effects.

# Appendix: Recovering Alfas from Fixed Effects (Least Squares Dummy Variables)

Suppose you are interested in to obtain a specific regression for firm 3. E.g., many international economists need to find a country-specific equation when they are dealing with country panels. If you are in this situation, don’t worry. The fixed effects estimators are already taking into account all individual effects. The only mysterious thing happening is that such individual intercepts are not being shown in the regression output.

You can recover the intercept of your cross-sectional unit after using fixed effects estimators. For the example above, let’s calculate the fixed effects model including dummy variables for each firm, instead of a common intercept (some authors call this Lest Squares Dummy Variables, but it is the same fixed effects you saw earlier). In R:

` summary(lm(lnc~lny+d1+d2+d3+d4+d5+d6-1))`

```
Call:
lm(formula = lnc ~ lny + d1 + d2 + d3 + d4 + d5 + d6 - 1)
Residuals:
Min 1Q Median 3Q Max
-0.214606 -0.061549 -0.006332 0.068760 0.224034
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lny 0.67428 0.06113 11.030 3.61e-09 ***
d1 -2.69353 0.38279 -7.037 2.00e-06 ***
d2 -2.91173 0.43958 -6.624 4.30e-06 ***
d3 -2.43996 0.52869 -4.615 0.000247 ***
d4 -2.13449 0.55880 -3.820 0.001371 **
d5 -2.31084 0.55325 -4.177 0.000632 ***
d6 -1.90351 0.60808 -3.130 0.006095 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1246 on 17 degrees of freedom
Multiple R-squared: 0.9991, Adjusted R-squared: 0.9987
F-statistic: 2582 on 7 and 17 DF, p-value: < 2.2e-16
```

The slope is obviously the same. The only change is the substitution of a common intercept for 6 dummies, each of them representing a cross-sectional unit. Now suppose you would like to know if the difference in the firms effects is statistically significant. How to do that?

- Regress the fixed effects estimators above, including the intercept and the dummies:

` summary(lm(lnc~lny+d1+d2+d3+d4+d5+d6))`

```
Call:
lm(formula = lnc ~ lny + d1 + d2 + d3 + d4 + d5 + d6)
Residuals:
Min 1Q Median 3Q Max
-0.214606 -0.061549 -0.006332 0.068760 0.224034
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.90351 0.60808 -3.130 0.006095 **
lny 0.67428 0.06113 11.030 3.61e-09 ***
d1 -0.79001 0.24369 -3.242 0.004795 **
d2 -1.00822 0.19126 -5.272 6.23e-05 ***
d3 -0.53645 0.11894 -4.510 0.000309 ***
d4 -0.23098 0.10111 -2.284 0.035474 *
d5 -0.40733 0.10396 -3.918 0.001107 **
d6 NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1246 on 17 degrees of freedom
Multiple R-squared: 0.9924, Adjusted R-squared: 0.9897
F-statistic: 368.8 on 6 and 17 DF, p-value: < 2.2e-16
```

Note that one of the dummies is dropped (due to perfect collinearity of the constant), and all other dummies are represented as the difference between their original value and the constant . (The value of the constant in this second regression equals the value of the dropped dummy in the previous regression. The dropped dummy is seen as the benchmark.)

- Obtain the R-squared from restricted (POLS) and unrestricted (fixed effects with dummies) models

```
R2LSDV<-summary(lm(lnc~lny+d1+d2+d3+d4+d5+d6))$r.squared
R2OLS<-summary(pols)$r.squared
```

- Perform the traditional F-test, comparing the unrestricted regression with the restricted regression:

\[ F_{(n-1, nT-n-K)}=\frac{[ (R_{u} ^2 - R_{p} ^2) / (n-1) ]}{[ (1 - R_{u} ^2) / (nT - n - k) ]} \; \; \; \;(7) \]

where the subscript “u” refers to the unrestricted regression (fixed effects with dummies), and the subscript “p” to the restricted regression (POLS). Under the null hypothesis, POLS are more efficient.

```
F<-((R2LSDV-R2OLS)/(6-1))/((1-R2LSDV)/(24-6-1))
F
```

`[1] 9.671526`

The result above can be compared with the critical value of F(5,17), which equals 4.34 at 1% level. Therefore, we reject the null hypothesis of common intercept for all firms.

# References:

Greene, William, 1997, Econometric Analysis, Third Edition, NJ: Prentice-Hall. Hausman, Jerry, 1978, “Specification Tests in Econometrics,” Econometrica, 46, pp.1251-1271. Johnston, Jack, and John DiNardo, 1997, Econometric Methods, Fourth Edition, NY: McGraw-Hill. Koenker, Roger, 2014, “Panel Data,” Lecture 17, mimeo, University of Illinois at Urbana-Champaign.

Please send comments to bottan2@illinois.edu or srmntbr2@illinois.edu↩