Applied econometrics: Econ 508
logo

Applied Econometrics 
Econ 508 - Fall 2014

Professor: Roger Koenker 

TA: Nicolas Bottan 

Welcome to a new issue of e-Tutorial. Here we will apply Hausman-Taylor (1981) instrumental variables approach to the phuzics data of Problem Set 4. The estimation strategy is explained in Prof. Koenker’s Lecture Note 17. 1

Data

The first thing you need is to download the phuzics panel data set, called phuzics10.txt from the Econ 508 web site. Save it in your preferred directory.

The next step is loading the Data in R. If you saved it in your hard drive you can load it by typing:

phuzics<-read.table("C:/econ508/phuzics10.txt", header=T, sep="")

or you can retrieve it from the web

phuzics<-read.table("http://www.econ.uiuc.edu/~econ508/data/phuzics10.txt", header=T, sep="")

Next you need to extract each variable from the data set:

id<-phuzics$id # id  : person identifier
yr<-phuzics$yr # yr  : current year - 1900
phd<-phuzics$phd # phd : year of PhD - 1900
sex<-phuzics$sex # sex : gender (female==1)
rphd<-phuzics$rphd # rphd: rank of PhD
ru<-phuzics$ru # ru  : dummy for research university(res.pos.==1)
y<-phuzics$y # y   : page equivalent in current year
Y<-phuzics$Y # Y   : discounted cumulative page equivalent
s<-phuzics$s # s   : current annual salary

The first step towards the panel data estimation is to transform your data into group means and deviations of group means. To do so we use the PQ function introduced in the previous e-TA, and available at the Econ 508 webpage (Routines, panel.R)

"PQ" <-function(h, id){
  if(is.vector(h)) h <- matrix(h, ncol = 1)
  Ph <- unique(id)
  Ph <- cbind(Ph, table(id))
  for(i in 1:ncol(h)) Ph <- cbind(Ph, tapply(h[, i], id, mean))
  is <- tapply(id, id)
  Ph <- Ph[is,  - (1:2)]
  Qh <- h - Ph
  list(Ph=as.matrix(Ph), Qh=as.matrix(Qh), is=is)
}  

You should apply this function for all variables used in your estimations. For example, you will see that the PQ FUNCTION will be used inside the function htiv, to run the Hausman-Taylor Instrumental Variables estimators.

The second step is to write your own program in order to compute the HTIV estimators. The Econ 508 webpage (Routines, panel.R) provides a function called htiv for this. You can download the file in the same way you did above.

"htiv" <-function(x, y, id, d, z = NULL){
  #input:
  #  x  design matrix partitioned as given in d
  # y  response vector
  # id strata indicator
  # d  list of column numbers indicating partitioning of x
  #     x[,d[[1]]] is x1 -- exogonous time varying vars
  #     x[,d[[2]]] is x2 -- endogonous time varying vars
  #     x[,d[[3]]] is z1 -- exogonous time invariant vars
  #     x[,d[[4]]] is z2 -- endogonous time invariant vars
  # z may contain excluded exogonous variables if there are any
  # NB. intercept is automatically included
  x <- as.matrix(cbind(x, 1))
  Tx <- PQ(x, id)
  d[[3]] <- c(d[[3]], dim(x)[2])
  Z <- cbind(z, Tx$Ph[, d[[1]]], Tx$Qh[, d[[1]]], Tx$Qh[, d[[2]]], 
    x[, d[[ 3]]])
    r <- tsls(x, Z, y, int = F)$resid
    Ti <- table(id)
    Ti.inv <- 1/table(id)
    rdot2 <- tapply(r, id, mean)^2
    v <- lm(rdot2~Ti.inv)
    v <- v$coef
    theta <- as.vector(sqrt(v[2]/(v[2] + v[1] * Ti[Tx$is])))
    x <- x - (1 - theta) * Tx$Ph
    y <- y - (1 - theta) * PQ(y, id)$Ph
    fit <- tsls(x, Z, y, int = F)
    list(fit=fit,v=v)
  }

Some details must be explained, though:

  1. If you haven’t run PQ and tsls until now, please do so. Otherwise htiv will not work.
"tsls" <-function(x, z, y, int = T){
  # two stage least squares
  if(int){
         x <- cbind(1, x)
         z <- cbind(1, z)
        }
  xhat <- lm(x~z-1)$fitted.values
  R <- lm(y ~ xhat -1)
  R$residuals <- c(y - x %*% R$coef)
  return(R)
}
  1. The most important detail: the user should specify the model, create new variables, and decide which variables will be included in the regression and/or treated as instruments.

Thus, it is essential to read Koenker’s Lecture 17 and Hausman-Taylor (1981), as well as a good interpretation of the PS4 and auxiliary papers, in order to understand what the function is doing and how you need to adjust it.

Estimating Phuzicists Productivity

In Problem Set 4 you are asked to explore “the phuzical revolution”. We will use this setting to see Hausman and Taylor’s approach at work. The model suggested in the Hints of the problem set is:

\[ log y_{it}= \Sigma_{s-1} ^{q} \rho log y_{it-s}+ f(t,t_{0i},t-t_0i,r_{i}) + u_{it} \]

so a working model may take the form:

\[ log y_{it} = \beta_0 + \Sigma_{s=1} ^2 \rho_i \log y_{it-s} + \beta_1e_{it} + \beta_2e^2_{i,t} + \beta_3 \frac{1}{e_{it} \times r_{i}} +\beta_4 gender_{i} + \beta_5 d80_i + \alpha_{it} + u_{it} \]

To estimate this model in R we must first arrange and “create” the variables needed. The first and most cumbersome part is to create the lag variables. One way to do it is

n <- length(phuzics[,1])
h <- as.matrix(phuzics)
h <- cbind(h[3:n,],h[3:n,]-h[2:(n-1),],h[2:(n-1),]-h[1:(n-2),]) #difference
h <- h[h[,10]==0,] #ignore obs whose first diff confounds people
h <- h[h[,19]==0,] #ignore obs whose second diff confounds people
h <- cbind(h[,1:9],h[,7:9]-h[,16:18],h[,7:9]-h[,16:18]-h[,25:27])
h <- cbind(h[,1:4],h[,2]-h[,3],(h[,2]-h[,3])^2,h[,5:15])
colnames(h)<- c("id","yr","phd","sex","exp","exp^2","rphd","ru", "y","Y","s","y_1","Y_1","s_1","y_2","Y_2","s_2")

To understand what is doing we can analyze line by line. First suppose you have the following data set, where it is sorted by id and yr in decreasing chronological order,

id yr y
1 t-4 \(y_{1t-4}\)
1 t-3 \(y_{1t-3}\)
1 t-2 \(y_{1t-2}\)
1 t-1 \(y_{1t-1}\)
1 t \(y_{1t}\)
2 t-3 \(y_{2t-3}\)
2 t-2 \(y_{2t-2}\)
2 t-1 \(y_{2t-1}\)
2 t \(y_{2t}\)

First we declare the data frame to be a matrix, to do some calculations more easily. The next line:

h <- cbind(h[3:n,],h[3:n,]-h[2:(n-1),],h[2:(n-1),]-h[1:(n-2),])

is taking differences, the data now looks like:

id yr y h[3:n,]-h[2:(n-1),] h[2:(n-1),]-h[1:(n-2),]
1 t-2 \(y_{1t-2}\) 0 1 \(y_{1t-2}-y_{1t-3}\) 0 1 \(y_{1t-3}-y_{1t-4}\)
1 t-1 \(y_{1t-1}\) 0 1 \(y_{1t-1}-y_{1t-2}\) 0 1 \(y_{1t-2}-y_{1t-3}\)
1 t \(y_{1t}\) 0 1 \(y_{1t}-y_{1t-1}\) 0 1 \(y_{1t-1}-y_{1t-2}\)
2 t-3 \(y_{2t-3}\) 1 3 \(y_{2t-3}-y_{1t}\) 0 1 \(y_{1t}-y_{1t-1}\)
2 t-2 \(y_{2t-2}\) 0 1 \(y_{2t-2}-y_{2t-1}\) 1 3 \(y_{2t-3}-y_{1t}\)
2 t-1 \(y_{2t-1}\) 0 1 \(y_{2t-1}-y_{2t-2}\) 0 1 \(y_{2t-2}-y_{2t-3}\)
2 t \(y_{2t}\) 0 1 \(y_{2t}-y_{2t-1}\) 0 1 \(y_{2t-1}-y_{2t-2}\)

Next we have to get rid of those rows where we are confunding people

h <- h[h[,10]==0,] #ignore obs whose first diff confounds people
h <- h[h[,19]==0,] #ignore obs whose second diff confounds people

So now the data looks like, note that we are dismissing those lines were the difference in id is not zero

id yr y \(h[3:n,]-h[2:(n-1),]\) \(h[2:(n-1),]-h[1:(n-2),] \)
1 t-2 \(y_{1t-2}\) 0 1 \(y_{1t-2}-y_{1t-3}\) 0 1 \(y_{1t-3}-y_{1t-4}\)
1 t-1 \(y_{1t-1}\) 0 1 \(y_{1t-1}-y_{1t-2}\) 0 1 \(y_{1t-2}-y_{1t-3}\)
1 t \(y_{1t}\) 0 1 \(y_{1t}-y_{1t-1}\) 0 1 \(y_{1t-1}-y_{1t-2}\)
2 t-1 \(y_{2t-1}\) 0 1 \(y_{2t-1}-y_{2t-2}\) 0 1 \(y_{2t-2}-y_{2t-3}\)
2 t \(y_{2t}\) 0 1 \(y_{2t}-y_{2t-1}\) 0 1 \(y_{2t-1}-y_{2t-2}\)

The next line of code, differences once more but now between columns to get the desired lags

h <- cbind(h[,1:9],h[,7:9]-h[,16:18],h[,7:9]-h[,16:18]-h[,25:27])

the data now looks line:

id yr y id yr \(h[,7:9]-h[,16:18]\) id yr \(h[,7:9]-h[,16:18]-h[,25:27])\)
1 t-2 \(y_{1t-2}\) 0 1 \(y_{1t-2}-\left(y_{1t-2}-y_{1t-3}\right)=y_{1t-3}\) 0 1 \(y_{1t-2}-\left(y_{1t-2}-y_{1t-3}\right)-\left(y_{1t-3}-y_{1t-4}\right)=y_{1t-4}\)
1 t-1 \(y_{1t-1}\) 0 1 \(y_{1t-2}\) 0 1 \(y_{1t-3}\)
1 t \(y_{1t}\) 0 1 \(y_{1t-1}\) 0 1 \(y_{1t-2}\)
2 t-1 \(y_{2t-1}\) 0 1 \(y_{2t-2}\) 0 1 \(y_{2t-3}\)
2 t \(y_{2t}\) 0 1 \(y_{2t-1}\) 0 1 \(y_{2t-2}\)

So we have create the lags we wanted. Then the two last lines are straight forward.

A second approach, that is somehow simpler is to take advantage of the package DataCombine which is basically a wrapper for dplyr a very powerful package to do data management

require(DataCombine)
h2<-phuzics
h2<-h2[order(id,-yr),] #have to order by group and time
h2<-slide(data=h2,Var="y", GroupVar="id", NewVar="y_1", slideBy=1) #first lag
h2<-h2[!(h2$id==209 | h2$id==212 | h2$id==378),] #get rid of the groups with one observation
h2<-slide(data=h2,Var="y", GroupVar="id", NewVar="y_2", slideBy=2)    
h2<-na.omit(h2) #get rid of NA
h2$exp<-h2$yr-h2$phd
h2$exp.2<-(h2$yr-h2$phd)^2

You can inspect and compare both approaches. Next we need to create a couple of variables to be ready to run our estimation. First we transform in logarithm our dependent variable:

  y <- log(h[,"y"])

Next we form our \(X\) matrix.

  x <- h[,c(12,15,5:7,3,4)]
  x[,1:2] <- log(x[,1:2])
  x[,6] <- as.numeric(x[,6]>80) #vintage effect of phd
  x[,5] <- 1/(x[,3]*x[,5]) #interaction experience and rank
  id <- h[,1]
  colnames(x) <- c("y_1","y_2","exp","exp2","ier","phd","sex")
  head(x)
          y_1      y_2 exp exp2        ier phd sex
[1,] 2.708050 1.098612   5   25 0.10000000   1   0
[2,] 2.995732 2.708050   6   36 0.08333333   1   0
[3,] 3.218876 2.995732   7   49 0.07142857   1   0
[4,] 2.833213 3.218876   8   64 0.06250000   1   0
[5,] 2.302585 2.833213   9   81 0.05555556   1   0
[6,] 2.197225 2.302585  10  100 0.05000000   1   0

Now we are ready to apply the HTIV approach. Note that the htiv function takes 5 arguments:

  • x which is the design matrix
  • y the dependent variable
  • id indicates the group
  • d is a list and indicates how \(X\) is partitioned
    1. the first element is the position numbers in which are the exogenous time varying variables
    2. the second is the position number of the endogenous time varying variables
    3. the third is the position of the exogenous time invariant vars
    4. the fourth is the position of the endogenous time invariant vars.
  • z is the vector that contain exclude exogenous variables, and the default is NULL
  vl <- list(3:4,c(1:2,5),6:7,NULL)
  v <- htiv(x,y,id,vl)
  print(v$v)
(Intercept)      Ti.inv 
 0.03745804  0.13370602 
  print(summary(v$fit))

Call:
lm(formula = y ~ xhat - 1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.49494 -0.30204 -0.01099  0.29573  1.76926 

Coefficients:
           Estimate Std. Error t value Pr(>|t|)    
xhaty_1   0.5432496  0.0124239  43.726  < 2e-16 ***
xhaty_2  -0.4260759  0.0118581 -35.931  < 2e-16 ***
xhatexp   0.1549161  0.0058075  26.675  < 2e-16 ***
xhatexp2 -0.0031500  0.0001298 -24.263  < 2e-16 ***
xhatier   2.1740321  0.4390081   4.952 7.56e-07 ***
xhatphd   0.0096239  0.0311961   0.308    0.758    
xhatsex  -0.0134985  0.0403353  -0.335    0.738    
xhat      1.2470580  0.0734263  16.984  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4425 on 5440 degrees of freedom
Multiple R-squared:  0.9158,    Adjusted R-squared:  0.9157 
F-statistic:  7401 on 8 and 5440 DF,  p-value: < 2.2e-16

References:

Hausman, Jerry, 1978, “Specification Tests in Econometrics,” Econometrica, 46, pp.1251-1271. Hausman, Jerry, and William Taylor, 1981, “Panel Data and Unobservable Individual Effects”, Econometrica, 49, No. 6, pp.1377-1398. Koenker, Roger, 2014, “Panel Data,” Lecture 17, mimeo, University of Illinois at Urbana-Champaign.


  1. Please send comments to bottan2@illinois.edu or srmntbr2@illinois.edu