Econ 508
Econometrics Group
Home | Faculty | Students | Alumni | Courses | Research | Reproducibility | Lab | Seminars | Economics | Statistics | Fame
Applied Econometrics
Econ 508 - Fall 2007

e-Tutorial 10:
Monte Carlo Simulation and Nonlinear Regression

Welcome to the tenth issue of e-Tutorial. The first part of the tutorial is useful for the first part of problem set 3. It will help you to understand details of Granger and Newbold (1974).

This time I focus on Monte Carlo Simulation and Nonlinear Regression. I introduce these two topics in form of examples connected to Econ 508 syllabus. For the Monte Carlo, I use the Granger-Newbold experiment on spurious regression as an example. For the Nonlinear Regression, I give examples of how to correct autoccorelated errors (Cochrane-Orcutt and NLS).

Monte Carlo Simulation:

In the Granger-Newbold (1974) experiment, the econometrician has a bivariate model as follows:

y(t) = b0 + b1 x(t) + e(t)
y(t) = y(t-1) + u(t)
x(t) = x(t-1) + v(t)
u(t) and v(t) are identically indepedently distributed. 

Hence, each variable has a unit root, and their differences, y(t)-y(t-1)=u(t) and x(t)-x(t-1)=v(t), are pure noise. However, as you regress y(t) on x(t), you may obtain significant linear relationship based on conventional OLS and t-statistics, contrary to your theoretical expectations.

Let's replicate the experiment. The algorithm is as follows:
- generate the errors of each series as random normal variables with mean zero, variance 1, and size 100
- build the series
- regress the series
- plot the series
- plot the fitting line and show the t-statistic of its slope

This corresponds to a single loop of the experiment, but it is already enough to show the nuisance. If you wish, you can go ahead and repeat this loop B times and check how many times you get significance. 

Below I provide the routine for one loop of the experiment in R, designed by Prof. Koenker, which can also also be accessed at the Econ 508 web page, Routines.

R Code:

#Start copying here:
#this is a simple version of the Granger-Newbold spurious regression simulation
for(i in 2:n)
x[i] <- x[i-1] + u[i] 
y[i] <- y[i-1] + v[i]
#you can also use filter(u,1,"recursive") here instead (somewhat more efficiently)
t <- 1:n
fit <- lm(y~x)
#make a title with the t-statistic of the slope coefficient of the fit
title(paste("t=",format(round(summary(fit)$coef["x","t value"],2))))
#End copying here.

An the output will be:

Because you are running regressions on random numbers, the experiments provide different results as you run each of them. Nevertheless, as many times you run as more convinced you will be that in a large proportion of the cases the t-statistics will be very significant, meaning that you have spurious regression in many cases.


Below I also provide the respective routine for one loop of the experiment using STATA, which can also also be accessed at the Econ 508 web page, Routines, under the name

. do "C:\Documents and Settings\econ472\WWW\"

. clear all

. set obs 100
obs was 0, now 100

. gen t=_n

. gen u=invnorm(uniform())

. gen y=u if t==1
(99 missing values generated)

. replace y=y[_n-1]+u if t>1
(99 real changes made)

. gen v=invnorm(uniform())

. gen x=v if t==1
(99 missing values generated)

. replace x=x[_n-1]+v if t>1
(99 real changes made)

. regress y x

  Source |       SS       df       MS                  Number of obs =     100
---------+------------------------------               F(  1,    98) =  282.35
   Model |   720.56674     1   720.56674               Prob > F      =  0.0000
Residual |  250.101289    98  2.55205397               R-squared     =  0.7423
---------+------------------------------               Adj R-squared =  0.7397
   Total |  970.668029    99  9.80472756               Root MSE      =  1.5975

       y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
       x |   .4040298   .0240448     16.803   0.000       .3563136    .4517459
   _cons |   .8007492   .3042343      2.632   0.010       .1970061    1.404492

. predict fit
(option xb assumed; fitted values)

. graph y x t, c(ll) title("Fig. 1: Y and X Time Series") saving (fig1, replace
> )

. graph y fit x, c(.l) s(o.) title ("Fig. 2: Y fitted on X") saving (fig2, repl
> ace)

. graph using, saving (null, replace)

. graph using fig1 null null fig2, saving (gn, replace)

end of do-file



Nonlinear Regression:

I explain nonlinear regression thorugh the data sets of the problem set 3. You can download them in ASCII format by clicking in the respecitive names: system1.dat and system2.dat. Save them in your preferred location (I'll save mine as "C:/system1.dat" and "C:/system2.dat"). Then I suggest you to open the files in Notepad (or another text editor) and type the name of the variable "year" in the first row, first column, i.e. before the variable "w". Use <Tab> to separate the names of variables. Save both files in text format in your favorite directory  (I will save mine as "C:/system1a.txt" and "C:/system2a.txt", respectively).

The problem:

Suppose you have the following equation:

Qt= a1 + a2pt-1+ a3zt + ut

and you suspect the errors are autocorrelated. The first you need to explain why pt-1 cannot be considered exogenous. Please also explain the consequences of applying OLS estimators in the presence of autocorrelated disturbances.

Testing for the presence of autocorrelation:

Next you need to test for the presence of autocorrelation. You already saw a large menu of options for that. Here I suggest the use of the Breusch-Godfrey test:

Here you are assuming that et is a white noise process and

Qt= a1 + a2pt-1+ a3zt + ut
ut= b1ut-1 + b2ut-2 + ...+ bpzt-p + et

You wish to test whether any of the coefficients of lagged residuals is different than zero. Your null hypothesis is no autocorrelation, i.e., b1=b2=...=bp=0. For example, to test for the presence of autocorrelation using 1 lag (AR(1)), do as follows:

*Run an OLS in your original equation:
regress Q L.p z

*Obtain the estimated residuals:
predict uhat, res

* Regress the estimated residuals on the explanatory variables of the original model and lag of residuals:
regress uhat L.p z L.uhat

*From the auxilliary regression above, obtain the R-squared and multiply it by number of included observations:
scalar nR2=_result(1)*_result(7)

This nR2 may be compared with a Chi-squared(1) distribution. If you need to test autocorrelation using p lags, compare your statistic with a Chi-squared (p). Hint: Be parsimonious - don't push the data too much, and try to use at most 4 lags. However, observe that the first lag is enough to detect autocorrelation here..

Correcting the Model:

After you detect autocorrelation, you need to correct it. Let's see how it works in our data. Suppose we have obtained the following AR(1) autocorrelation process:

(1)    Qt= a1 + a2pt-1+ a3zt + ut
(2)    ut= b1ut-1 + et

where et is white noise. Now generate the lag of equation (1):

(3)   Qt-1= a1 + a2pt-2+ a3zt-1 + ut-1

and substitute (3) into the equation (2):

(4)   ut= b1(Qt-1- a1 - a2pt-2- a3zt-1) + et

Substituting equation (4) into the equation 1 will give you an autocorrelation-corrected version of your original equation:

(5)   Qt= a1 + a2pt-1+ a3zt + ut
        = a1 + a2pt-1+ a3zt + b1(Qt-1- a1 - a2pt-2- a3zt-1) + et
      = b1Qt-1 + a1(1 - b1) + a2pt-1+ a3zt - b1a2pt-2- b1a3zt-1 + et

This corrected model contain multiplicative terms in the coefficients, and therefore need to be re-estimated by Non-linear Least Squares (NLS). (Hint: Here the coefficients are non-linear. So, you need NLS. But if instead of the coefficients, the variables were non-linear, than you could have used linear models like OLS.) 

In STATA you should write the following program to run this model:

program define nlg
    if "`1'"  ==  "?"  {
        global S_1   "rho  a1  a2  a3"
        global rho =.5
        global a1  =.5
        global a2  =.5
        global a3  =.5
    replace `1' = $rho*L.Q + $a1*(1-$rho) + $a2*L.p + $a3*z - $rho*$a2*L2.p  - $rho*$a3*L.z

In the program above:
Line 1:      corresponds to the name of the program (g is the name, nl is the method)
Line 2:      corresponds to the expression to be created, in which you are going to insert your regression equation
Line 3:      corresponds to the vector of parameters to be calibrated. In the equation (5) above, rho corresponds to b1,  while a1, a2, and a3 are themselves.
Lines 4-7: corresponds to the initial value of each parameters for the calibration, The system will start with those values and them, using an iterative process, it will try to converge to a final value value of each parameter that optimizes the whole equation.
Lines 8-9: used to separate different parts within the same program.
Line 10:    corresponds to your equation (5), using the parameters created in S_1
Line 11:    finishes your program.

Ok. After you have created the program g, you need to run it. Attempt to one detail: the program does not work in the presence of missing values. As you have generated lags, you also have lost the first 2 observations. Thus, you need to restrict your optimization problem to the existing observations.

nl g Q if year>1960.25

That is it. The output should give you the requested elements to calculate equation (5) - the model adjusted for autocorrelation. Similar result can be obtained through the Cochrane-Orcutt method. In STATA, you can obtain that as follows:

prais Q  L.p  z, corc

This will give you the adjusted coefficients of each explanatory variable, plus the Durbin-Watson statistic before and after the correction. Compare the Cochrane-Orcut  results with the NLS results, and see how similar they are. Then compare both NLS and Cochrane-Orcutt with the naive OLS from question 1, and draw your conclusions.

Granger, C., and P. Newbold (1974): "Spurious Regression in Econometrics," Journal of Econometrics, 2, 111-120 

 Last update: October 2, 2007