Applied econometrics: Econ 508

## TA: Nicolas Bottan

Welcome to a new e-Tutorial. The first part of the tutorial is useful for the first part of problem set 3. It will help you to understand details of Granger and Newbold (1974). This time we focus on Monte Carlo Simulation, Nonlinear Regression, and Simultaneous Equations Models. I introduce these two topics in form of examples connected to Econ 508 syllabus. For the Monte Carlo, we use the Granger-Newbold experiment on spurious regression as an example. For the Nonlinear Regression, we give examples of how to correct autocorrelated errors (NLS and Cochrane-Orcutt). Next, we introduce simultaneous dynamic equations and exogeneity (Hausman) tests. I would like to remark that the theoretical background given in class is essential to proceed with the computational exercise below. Thus, I recommend you to consult Prof. Koenker’s Lectures Notes as you go through the tutorial.1

# Monte Carlo Simulation:

In the Granger-Newbold (1974) experiment, the econometrician has a bivariate model as follows:

$y_{t} = b_{0} + b_{1} x_{t} + e_{t}$ where $y_{t} = y_{t-1} + u_{t}$ $x_{t}= x_{t-1} + v_{t}$ $$u_{t}$$ and $$v_{t}$$ are identically independently distributed. Hence, each variable has a unit root, and their differences, $$y_{t}-y_{t-1}=u_{t}$$ and $$x_{t}-x_{t-1}=v_{t}$$, are pure noise. However, as you regress $$y_{t}$$ on $$x_{t}$$, you may obtain significant linear relationship based on conventional OLS and t-statistics, contrary to your theoretical expectations.

Let’s replicate the experiment. The algorithm is as follows: - generate the errors of each series as random normal variables with mean zero, variance 1, and size 100 - build the series - regress the series - plot the series - plot the fitting line and show the t-statistic of its slope

This corresponds to a single loop of the experiment, but it is already enough to show the nuisance. If you wish, you can go ahead and repeat this loop B times and check how many times you get significance.

Below I provide the routine for one loop of the experiment in R, designed by Prof. Koenker, which can also also be accessed at the Econ 508 web page, Routines.

R Code:

#this is a simple version of the Granger-Newbold spurious regression simulation
set.seed(1212)
n=100
u<-rnorm(n)
v<-rnorm(n)
y<-rep(0,n)
x<-rep(0,n)
for(i in 2:n){
x[i] <- x[i-1] + u[i]
y[i] <- y[i-1] + v[i]
}

Since this is a simulation it is very important to set.seed() in order to allow reproducible results. You can also use filter(u,1,"recursive") here instead (somewhat more efficiently). Then you can get and plot the results

par(mfrow=c(1,2))
t <- 1:n
plot(c(t,t),c(x,y),type="n")
lines(t,x,col="red")
lines(t,y,col="blue")
plot(x,y)
fit <- lm(y~x)
abline(fit)
#make a title with the t-statistic of the slope coefficient of the fit
title(paste("t=",format(round(summary(fit)$coef["x","t value"],2)))) Note that the result of this single simulation you get a very significant t-statistic even though there’s no relation between the series. Nevertheless, as many times you run as more convinced you will be that in a large proportion of the cases the t-statistics will be very significant, meaning that you have spurious regression in many cases. Instead of running many times the experiment we can let R do it for you. We set up the following experiment. We are going to repeat the experiment $$R=1000$$ times, the sample size is $$n=100$$, and we run the same simulation as above an record the t-values in the A matrix. The last step we compare with the critical value of a t at $$5\%$$ level. Then we evaluate how many of those are bigger than the critical value set.seed(1212) R <- 1000 n <- 100 A <- array(0,c(1,1000)) for(i in 1:R){ u<-rnorm(n) v<-rnorm(n) y<-rep(0,n) x<-rep(0,n) for(j in 2:n){ x[j] <- x[j-1] + u[j] y[j] <- y[j-1] + v[j] } A[1,i] <- round(summary(lm(y~x))$coef["x","t value"],2)
}

B<-as.matrix(abs(A)>abs(qt(0.05,n-1)))
apply(B,1,sum)
[1] 807
plot(density(A), main="Density of t values")

The results of our simulation illustrate that for 807 out of a $$1000$$ simulations we reject the null hypothesis, which means that we are getting a great amount of significant t-statistic even though there’s no relation between the series by construction.

# Nonlinear Regression:

We explain nonlinear regression through the data sets of the problem set 3. You can download them in ASCII format by clicking in the respective names on the data webpage: cob1.txt and cob2.txt. Save them in your preferred location. Then I suggest you to open the files in Notepad (or another text editor) and type the name of the variable “year” in the first row, first column, i.e. before the variable “w”. Use to separate the names of variables. Save both files in text format in your favorite directory.

The other solution is to read it from the webpage:

names(cob1)[names(cob1) == 'row.names']<-"year"  #Add the name "year" to the first column
p<-ts(cob1$p,start=1960,freq=4) Q<-ts(cob1$Q,start=1960,freq=4)
z<-ts(cob1$z,start=1960,freq=4) w<-ts(cob1$w,start=1960,freq=4)

## The problem:

Suppose you have the following equation:

$Q_{t}= a_{1} + a_{2} p_{t-1}+ a_{3} z_{t} + u_{t}$

and you suspect the errors are autocorrelated. The first you need to explain why $$p_{t}$$ cannot be considered exogenous. Please also explain the consequences of applying OLS estimators in the presence of autocorrelated disturbances.

## Testing for the presence of autocorrelation:

Next you need to test for the presence of autocorrelation. You already saw a large menu of options for that. Here I suggest the use of the Breusch-Godfrey test:

Background: Here you are assuming that $$e_{t}$$ is a white noise process and

$Q_{t}= a_{1} + a_{2} p_{t-1}+ a_{3} z_{t} + u_{t}$ $u_{t}= b_{1} u_{t-1} + b_{2} u_{t-2} + ...+ b_{p} z_{t-p} + e_{t}$

You wish to test whether any of the coefficients of lagged residuals is different than zero. Your null hypothesis is no autocorrelation, i.e., $$b_{1}=b_{2}=...=b_{p}=0$$. For example, to test for the presence of autocorrelation using 1 lag (AR(1)), do as follows:

*Run an OLS in your original equation:

library(dynlm)

Attaching package: 'zoo'

The following objects are masked from 'package:base':

as.Date, as.Date.numeric
S<-dynlm(Q ~ L(p,1) + z)
summary(S)

Time series regression with "ts" data:
Start = 1960(2), End = 2009(4)

Call:
dynlm(formula = Q ~ L(p, 1) + z)

Residuals:
Min     1Q Median     3Q    Max
-6.808 -1.771  0.039  1.746  5.931

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  -0.3872     0.6114   -0.63   0.5273
L(p, 1)       0.9422     0.0441   21.36   <2e-16 ***
z             0.4860     0.1735    2.80   0.0056 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.56 on 196 degrees of freedom
Multiple R-squared:   0.7,  Adjusted R-squared:  0.697
F-statistic:  228 on 2 and 196 DF,  p-value: <2e-16

*Obtain the estimated residuals:

uhat<-resid(S)
• Regress the estimated residuals on the explanatory variables of the original model and lag of residuals:
AuxReg<-dynlm(uhat ~ L(p,1) + z + L(uhat,1))
summary(AuxReg)

Time series regression with "ts" data:
Start = 1960(3), End = 2009(4)

Call:
dynlm(formula = uhat ~ L(p, 1) + z + L(uhat, 1))

Residuals:
Min     1Q Median     3Q    Max
-4.852 -1.251  0.086  1.343  4.405

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  -1.1107     0.4715   -2.36   0.0195 *
L(p, 1)       0.1023     0.0343    2.98   0.0032 **
z             0.0326     0.1310    0.25   0.8040
L(uhat, 1)    0.6769     0.0554   12.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.93 on 194 degrees of freedom
Multiple R-squared:  0.435, Adjusted R-squared:  0.426
F-statistic: 49.7 on 3 and 194 DF,  p-value: <2e-16

*From the auxiliary regression above, obtain the R-squared and multiply it by number of included observations:

nR2<-summary(AuxReg)$r.squared*nrow(cob1) cbind(nR2,qchisq(0.95,1)) nR2 [1,] 86.93 3.841 This nR2 may be compared with a $$\chi^{2}_{1}$$ distribution. If you need to test autocorrelation using $$p$$ lags, compare your statistic with a $$\chi^{2}_{p}$$. Hint: Be parsimonious - don’t push the data too much, and try to use at most 4 lags. However, observe that the first lag is enough to detect autocorrelation here. ## Correcting the Model: After you detect autocorrelation, you need to correct it. Let’s see how it works in our data. Suppose we have obtained the following AR(1) autocorrelation process: $Q_{t}= a_{1} + a_{2} p_{t-1}+ a_{3} z_{t} + u_{t} \; \; \;(1)$ $u_{t}= b_{1} u_{t-1} + e_{t} \; \; \; (2)$ where $$e_{t}$$ is white noise. Now generate the lag of equation (1): $Q_{t-1}= a_{1} + a_{2} p_{t-2}+ a_{3} z_{t-1} + u_{t-1} \; \; \;(3)$ and substitute (3) into the equation (2): $u_{t}= b_{1} (Q_{t-1}- a_{1} - a_{2} p_{t-2}- a_{3} z_{t-3}) + e_{t} \; \; \;(4)$ Substituting equation (4) into the equation 1 will give you an autocorrelation-corrected version of your original equation: $Q_{t}= b_{1} Q_{t-1} + a_{1} (1 - b_{1}) + a_{2} p_{t-1}+ a_{3} z_{t} - b_{1} a_{2} p_{t-2} - b_{1} a_{3} z_{t-1} + e_{t} \; \; \;(5)$ This corrected model contain multiplicative terms in the coefficients, and therefore need to be re-estimated by Non-linear Least Squares (NLS). To implement Non-linear Least Squares you can use the nls function to estimate your model NLS<-nls(Q ~ b1*lag(Q) + a1*(1-b1) + a2*lag(p) + a3*z - b1*a2*lag(p,2) -b1*a3*lag(z), start=list(b1=0.5,a1=0.5,a2=0.5,a3=0.5),trace=T) That is it. The output should give you the requested elements to calculate equation (5) - the model adjusted for autocorrelation. Similar result can be obtained through the Cochrane-Orcutt method. library(orcutt) cochrane.orcutt(dynlm(Q ~ L(p,1) + z))$Cochrane.Orcutt

Call:
lm(formula = YB ~ XB - 1)

Residuals:
Min     1Q Median     3Q    Max
-4.713 -1.245  0.026  1.314  4.255

Coefficients:
Estimate Std. Error t value Pr(>|t|)
XB(Intercept)   -1.458      0.726   -2.01    0.046 *
XBL(p, 1)        1.026      0.025   41.10   <2e-16 ***
XBz              0.590      0.291    2.03    0.044 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.91 on 195 degrees of freedom
Multiple R-squared:  0.922, Adjusted R-squared:  0.921
F-statistic:  769 on 3 and 195 DF,  p-value: <2e-16

$rho [1] 0.6691$number.interaction
[1] 5

This will give you the adjusted coefficients of each explanatory variable. Compare the Cochrane-Orcut results with the NLS results, and see how similar they are. Then compare both NLS and Cochrane-Orcutt with the naive OLS from question 1, and draw your conclusions.

# Second half: Problem set 3

For the first question you need to run the following system by OLS

Suppose you have the following equation:

$Q_{t}= a_{1} + a_{2} p_{t-1}+ a_{3} z_{t} + u_{t} \; \; \; \; \; (Supply)$ $p_{t}= b_{1} + b_{2} Q_{t}+ b_{3} w_{t} + v_{t} \; \; \; \; \; (Demand)$

require(dynlm)
S<-dynlm(Q ~ L(p,1) + z)
summary(S)

Time series regression with "ts" data:
Start = 1960(2), End = 2009(4)

Call:
dynlm(formula = Q ~ L(p, 1) + z)

Residuals:
Min     1Q Median     3Q    Max
-6.808 -1.771  0.039  1.746  5.931

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  -0.3872     0.6114   -0.63   0.5273
L(p, 1)       0.9422     0.0441   21.36   <2e-16 ***
z             0.4860     0.1735    2.80   0.0056 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.56 on 196 degrees of freedom
Multiple R-squared:   0.7,  Adjusted R-squared:  0.697
F-statistic:  228 on 2 and 196 DF,  p-value: <2e-16
D<-dynlm(p ~ Q + w)
summary(D)

Time series regression with "ts" data:
Start = 1960(1), End = 2009(4)

Call:
dynlm(formula = p ~ Q + w)

Residuals:
Min     1Q Median     3Q    Max
-7.134 -2.123 -0.231  2.012 11.142

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   8.5854     0.6386   13.44   <2e-16 ***
Q            -0.4854     0.0497   -9.76   <2e-16 ***
w             4.1194     0.3521   11.70   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.04 on 197 degrees of freedom
Multiple R-squared:  0.468, Adjusted R-squared:  0.463
F-statistic: 86.6 on 2 and 197 DF,  p-value: <2e-16

and you suspect the errors are autocorrelated. The first you need to explain why $$p_{t}$$ cannot be considered exogenous. Please also explain the consequences of applying OLS estimators in the presence of autocorrelated disturbances.

For the graph, first consider the equations in the steady state. Then use the last observations values z and w. Finally plot those two equations in a single cartesian graph. Don’t forget to find the equilibrium price and quantity. Compare with your graph. Check if there’s convergence in the graph (see Prof. Koenker class notes).

Q1<-seq(1,20,1)
P_s<- (S$coeff[1]/S$coeff[2]) + (1/S$coeff[2])*Q1 -(S$coeff[3]/S$coeff[2])*z[length(z)] P_d<- D$coeff[1] + D$coeff[2]*Q1 + D$coeff[3]*w[length(w)]

plot(P_s,type='l',col='blue',xlab='Quantity',ylab='Price')
lines(P_d,col='red')
legend(2,20, legend=c('Supply','Demand'),col=c('blue','red'),lty=c(1,1),cex=0.8)

# Hausman Test

When testing make sure to explain the null and the alternative hypothesis, and describe how the test is computed. The choice of instruments is crucial: you need to select instruments that are exogenous, orthogonal to errors, but correlated with included variables. Recall from the Lecture Notes the models is

$y = Y \gamma + X_{1} \beta_{1} + X_{2} \beta_{2} + u$

where $$Y$$ are included endogenous, $$X_{1}$$ are the “included exogenous”, and $$X_{2}$$ are the “dubious exogenous”

y<-as.matrix(Q)[2:200,1]
x2<-as.matrix(p)[1:199,1] #pt-1

W1<-cbind(as.matrix(z)[2:200,1],as.matrix(w)[2:200,1],as.matrix(z)[1:199,1],as.matrix(w)[1:199,1])
W0<-cbind(W1,x2)

Then we calculate the deltas

$\delta = argmin \; \hat{u'} P_X \hat{u'}$

which is nothing but two stage least squares

tsls <-function(x, z, y, int = T){
if(int){
x <- cbind(1, x)
z <- cbind(1, z)
}
xhat <- lm(x~z-1)$fitted.values R <- lm(y ~ xhat -1) R$residuals <- c(y - x %*% R$coef) return(R) } d0<-tsls(x2,W0,y) d0 Call: lm(formula = y ~ xhat - 1) Coefficients: xhat xhatx 0.671 0.928 d1<-tsls(x2,W1,y) d1 Call: lm(formula = y ~ xhat - 1) Coefficients: xhat xhatx 0.573 0.937 Since the Hausman Test takes the form $\hat{\Delta} \Omega \hat{\Delta}$ where $$\hat{\Delta}$$ is the difference of the deltas and $$\Omega$$ is the covariance matrix: we need to find this omega. Following Lecture Note 13 one way to do it is. Vd <- vcov(d1) - vcov(d0) Vd xhat xhatx xhat 0.76515 -0.075288 xhatx -0.07529 0.007408 Then the only thing left is to calculate the test statistic. delta <- d0$coef - d1$coef Hausman <- delta %*% solve(Vd) %*% delta Hausman [,1] [1,] 0.01258 After you obtain your test statistic-a scalar-, you should compare it with a Chi-squared (1). The degrees of freedom correspond to the number of dubious variables, i.e., the number of variables included as instruments in the first equation, but not in the second equation. As always you can find a package in R that does this job for you. In this case is the systemfit package. You can use it and compare it’s results with yours. library(systemfit) Loading required package: Matrix Loading required package: car Loading required package: lmtest hausman.systemfit(d1,d0) Hausman specification test for consistency of the 3SLS estimation data: unknown Hausman = 0.0126, df = 2, p-value = 0.9937 If you want to do estimate the $$\Omega$$ in a correct way you can use the following function to calculate it var.H <- function(m,z,w, int = T){ if(int){ w <- cbind(1,w) z <- cbind(1,z) } s <- crossprod(m$resid,m$resid)/length(m$resid)
Px <- w%*%solve(t(w)%*%w)%*%t(w)
vcov <- as.numeric(s)*solve(t(z)%*% Px %*% z)
return(vcov)
}

## References

Granger, C., and P. Newbold (1974): “Spurious Regression in Econometrics,” Journal of Econometrics, 2, 111-120