Applied econometrics: Econ 508
logo

Applied Econometrics 
Econ 508 - Fall 2014

Professor: Roger Koenker 

TA: Nicolas Bottan 

Welcome to the seventh tutorial of Econ 508. In the current issue I am going to summarize some well known tests for autocorrelation and ARCH processes. I draw on Johnston and DiNardo’s (1997) Econometric Methods, and Professor Koenker’s Lecture 7. We also provide additional support on testing for heteroscedasticity (see Appendix) and a starting point for those who want to explore further aspects of ARCH and GARCH process (see Perrelli, 2001). 1

Tests for Autocorrelated Errors

Background:

If you run a regression without lagged variables, and detect autocorrelation, your OLS estimators are unbiased, consistent, but inefficient and provide incorrect standard errors. In the case that you include lagged dependent variables among the covariates and still detect autocorrelation, then you are in bigger trouble: OLS estimators are inconsistent.

To test for the presence of autocorrelation, you have a large menu of options. Here we suggest the use of the Breusch-Godfrey test, and we will show how to implement this test using the dataset AUTO2.dta, which you can download from the Econ 508 web site (Data). As you will see, this adapted data set contains five series.

    auto<-read.table("http://www.econ.uiuc.edu/~econ508/data/AUTO2.txt",header=T)
    head(auto)
  quarter    gas price income miles
1    1959 -8.015 4.676 -4.505 2.648
2    1959 -8.011 4.691 -4.493 2.648
3    1959 -8.020 4.689 -4.499 2.648
4    1959 -8.013 4.722 -4.492 2.648
5    1960 -8.017 4.707 -4.490 2.647
6    1960 -7.976 4.699 -4.489 2.647

As we did before we need to transform the data in “time series” first:

    library(dyn)
Loading required package: zoo

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
    gas<-ts(auto$gas,start=1959,frequency=4)
    price<-ts(auto$price,start=1959,frequency=4)
    income<-ts(auto$income,start=1959,frequency=4)
    miles<-ts(auto$miles,start=1959,frequency=4)
    price2<-price^2
    princ<-price*income

Test: Breusch-Godfrey

Background:

Suppose you are running a version of model (2), problem set 2, in which the original data is replaced by AUTO2. Then your model will be:

\[ gas_{t} = \beta_{0} + \beta_{1} income_{t}+ \beta_{2} price_{t} + \beta_{3} (price_{t})^{2} + \beta_{4} (price_{t}*income_{t}) + u_{t} \]

and you wish to test whether the disturbances are autocorrelated. The steps to do that are as follows:

  1. Run an OLS in your original equation:
model<-lm(gas~income+price+price2+princ) 
  1. Obtain the estimated residuals:
uhat<-model$resid
uhat<- ts(uhat,start=1959,frequency=4)
  1. Regress the estimated residuals (uhat) on the explanatory variables of the original model (income, price, price2, priceinc, constant) and lagged residuals (L.uhat). Call this the auxiliary regression.
model.adj<-dyn$lm(uhat~lag(uhat,-1)+income+price+price2+princ) 
  1. From the auxiliary regression above, obtain the R-squared and multiply it by the number of included observations:
R2<-summary(model.adj)$r.squared
R2
[1] 0.9091

Constructing R2:

SSR<-sum((model.adj$resid)^2)
SSR
[1] 0.03443
SSE<-sum((model.adj$fitted-mean(uhat))^2)
#Note that R gives to you just the fitted y, 
#so you need to subtract the mean in order to get the explained sum squared.

SSE
[1] 0.3443
SST<-SSR+SSE
R2<- SSE/SST
  1. Under the null hypothesis of no autocorrelation, the test statistic NR2 converges asymptotically to a Chi-squared with s degrees of freedom, where s is the number of lags of the residuals included in the auxiliary regression. In the case above, s=1, and we have:
N<-127 #Sample size

# Or N<-(model$df)+length(model$coef) 

N*R2
[1] 115.5

In the example above, NR2 = 115.46 > 3.84 = Chi2 (1, 5%). Hence, we reject the null hypothesis of no autocorrelation on the disturbances.

Test for ARCH Errors

To test for ARCH errors, you can use an LM test as follows:

  1. Run an OLS in your original equation:
model2<-lm(gas~income+price+price2+princ) 
  1. Generate the residuals and the squared residuals.
uhat2<-(model$resid)^2
uhat2<-ts(uhat2,start=1959,frequency=4)
  1. Regress squared residuals on the explanatory variables of the original model (income, price, price2, priceinc, constant) and lagged squared residuals. Call this an auxiliary regression.
f<-dyn$lm(uhat2~lag(uhat2,-1)+lag(uhat2,-2)+ lag(uhat2,-3)+lag(uhat2,-4)+price+ income+price2+princ)
  1. From the auxiliary regression, calculate NR2 and compare with a Chi-squared (q, 5%), where q is the number of included lags of the squared residuals:
R2<-summary(f)$r.squared
n<-(model$df)+length(model$coef) 
n*R2
[1] 91.93

Under the null hypothesis of no ARCH errors, the test statistic NR2 converges asymptotically to a Chi-squared with q degrees of freedom, where q is the number of lags of the squared residuals included in the auxiliary regression. In the case above, q=4, and NR2=91.93 > 9.49 = Chi-squared(4, 5%). Therefore, we reject the null hypothesis of no ARCH, and admit that our regression presents time-varying variance.

Appendix: Tests for Heteroscedasticity

Under heteroscedastic errors, it is well known that OLS estimators are unbiased and consistent, but inefficient and provide incorrect standard errors. Hence it is very important to detect this anomaly in your regression.

We will illustrate how to test for heteroscedasticity using Current Population Survey (CPS) data consisting on 100 observations on wages, educational level, years of experience, and unionization status of U.S. male workers. The data was borrowed from J&DN’s (1997) Econometric Methods, and slightly adjusted for the purposes of this tutorial. The variables are defined as follows:

Variable Description
lnwage log of hourly wage in dollars
grade Highest educational grade completed
exp Years of experience
union Dummy variable: 1 if union member, 0 otherwise

You can download the data directly form the Econ 508 website

    d.d<-read.table("http://www.econ.uiuc.edu/~econ508/data/CPS.txt",header=T)
    head(d.d)
  lnwage grade exp union
1  2.331     8  22     0
2  1.504    14   2     0
3  3.912    16  22     0
4  2.197     8  34     1
5  2.788     9  47     0
6  2.351     9  32     0
    lnwage<-d.d$lnwage
    grade<-d.d$grade
    exp<-d.d$exp
    union<-d.d$union
    exp2<-exp^2

After you download the data, the next step is to run a “traditional” wages equation involving the variables above described. In R, you can do that as follows:

    summary(lm(lnwage~grade+exp+exp2+union))

Call:
lm(formula = lnwage ~ grade + exp + exp2 + union)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0155 -0.2864 -0.0444  0.2938  1.4536 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.595106   0.283485    2.10  0.03845 *  
grade        0.083543   0.020093    4.16    7e-05 ***
exp          0.050274   0.014137    3.56  0.00059 ***
exp2        -0.000562   0.000288   -1.95  0.05395 .  
union        0.165929   0.124454    1.33  0.18564    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.47 on 95 degrees of freedom
Multiple R-squared:  0.372, Adjusted R-squared:  0.345 
F-statistic: 14.1 on 4 and 95 DF,  p-value: 4.79e-09

Test 1: White

Here the strategy is as follows:

  1. Run the OLS regression (as you’ve done above, the results are omitted):
    g<-lm(lnwage~grade+exp+exp2+union)
  1. Get the residuals:
    g.resid<-g$resid
  1. Generate the squared residuals:
    g.resid2<-g.resid^2
  1. Generate new explanatory variables, in the form of the squares of the explanatory variables and the cross-product of the explanatory variables:
    grade2<-grade^2 
    exp4<-exp2^2 
    gradexp<-grade*exp 
    gradexp2<-grade*exp2 
    gradeuni<-grade*union 
    exp3<-exp*exp2 
    expunion<-exp*union 
    exp2uni<-exp2*union

Because union is a dummy variable, its squared values are equal to the original values, and we don’t need to add the squared dummy in the model. Also the squared experience was already in the original model (in the form of exp2), so we don’t need to add that in this auxiliary regression.

  1. Regress the squared residuals into a constant, the original explanatory variables, and the set of auxiliary explanatory variables (squares and cross-products) you’ve just created:
    g.final<-lm(g.resid2~grade+exp+exp2+union+grade2+exp4+exp3+gradexp +gradexp2+gradeuni+expunion+exp2uni)
  1. Get the sample size (N) and the R-squared (R2), and construct the test statistic N*R2:
N<-(g$df)+length(g$coef)
R2<-summary(g.final)$r.squared 
N*R2
[1] 10.79
  1. Under the null hypothesis, the errors are homoscedastic, and NR2 is asymptotically distributed as a Chi-squared with k-1 degrees of freedom (where k is the number of coefficients on the auxiliary regression). In this last case, k=13.

And we observe that the test statistic NR2 is about 10.7881, while the Chi-squared(12, 5%) is about 9.49, much bigger than the test statistic. Hence, the null hypothesis (homoscedasticity) can not be rejected.

Test 2: Breusch-Pagan-Godfrey

The Lagrange Multiplier test proposed by BPG can be executed as follows:

  1. Run the OLS regression (as you’ve done above, the output is omitted):
    g<-lm(lnwage~grade+exp+exp2+union)
  1. Get the sum of the squared residuals:
    g.resid<-g$resid
    g.ssr<-sum((g$resid)^2)
    g.ssr 
[1] 20.99
  1. Generate a disturbance correction factor in the form of sum of the squared residuals divided by the sample size:
    dcf<-g.ssr/((g$df)+length(g$coef))
  1. Regress the adjusted squared errors (in the form of original squared errors divided by the correction factor) on a list of explanatory variables supposed to influence the heteroscedasticity. Following JDN, we will assume that, from the original dataset, only the main variables grade, exp, and union affect the heteroscedasticity. Hence:
    adjerr2<-(g.resid^2)/dcf
    g.bptest<-lm(adjerr2~grade+exp+union)
    summary(g.bptest)

Call:
lm(formula = adjerr2 ~ grade + exp + union)

Residuals:
   Min     1Q Median     3Q    Max 
-1.548 -0.861 -0.451  0.289  8.777 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.32610    0.94920   -0.34     0.73
grade        0.09894    0.06435    1.54     0.13
exp          0.00995    0.01320    0.75     0.45
union       -0.58243    0.39633   -1.47     0.14

Residual standard error: 1.58 on 96 degrees of freedom
Multiple R-squared:  0.0428,    Adjusted R-squared:  0.0129 
F-statistic: 1.43 on 3 and 96 DF,  p-value: 0.239

This auxiliary regression gives you a model sum of squares (ESS):

    ess<-sum((g.bptest$fitted-mean(adjerr2))^2)      
  1. Under the null hypothesis of homoscedasticity, (1/2) ESS asymptotically converges to a Chi-squared(k-1, 5%), where k is the number of coefficients on the auxiliary regression. In the last case, k=4. Hence, we need to compare (1/2) ESS with a Chi-squared with 3 degrees of freedom and 5%. Doing so we get (1/2) ESS = 5.35, while the critical value of a Chi-squared (3, 5%) = 7.81. Therefore, the test statistic falls short of the critical value, and the null hypothesis of homoscedasticity can not be rejected.

Test 3: Goldfeld-Quandt

Suppose now you believe a single explanatory variable is responsible for most of the heteroscedasticy in your model. For example, let’s say that experience (exp) is the “trouble-maker” variable. Hence, you can proceed with the Goldfeld-Quandt test as follows:

  1. Sort your data according to the variable exp. Then divide your data in, say, three parts, drop the observations of the central part, and run separate regressions for the bottom part (Regression 1) and the top part (Regression 2). After each regression, ask for the respective Residual Sum of Squares RSS:

  2. Then compute the ratio of the Residuals Sum of Squares, R= RSS2/RSS1. Under the null hypothesis of homoscedasticity, this ratio R is distributed according to a \(F_{\left(\frac{(n-c-2k)}{2},\frac{(n-c-2k)}{2}\right)}\), where n is the sample size, c is the number of dropped observations, and k is the number of regressors in the model.

This is left for the reader as an exercise. To check your results you should get: \(R < F\), and as a consequence can not reject the null hypothesis of homocedasticity

A simpler Approach

The three Heteroscedasticity tests here presented are clasicals ones and so they are very likely to be packages that already calculate this for you. One of such packages is lmtest package. For example you could do the Breusch-Pagan-Godfrey test by typing

    require(lmtest)
Loading required package: lmtest
    bptest(lnwage~grade+exp+exp2+union, studentize=FALSE)

    Breusch-Pagan test

data:  lnwage ~ grade + exp + exp2 + union
BP = 6.116, df = 4, p-value = 0.1906

Note that the results are somehow different, but it hs to do on how the bptest function was written. However, the conclusion does not change.

Another thing to note is the option studentize=FALSE note that the default for this funciton is set to TRUE Prof. Koenker’s studentized version of the test statistic is used.

You can also run a Goldfeld-Quandt test and check wether your results following the above steps coincide with the output of the gqtest included in the package

    gqtest(lnwage~grade+exp+exp2+union)

    Goldfeld-Quandt test

data:  lnwage ~ grade + exp + exp2 + union
GQ = 1.492, df1 = 45, df2 = 45, p-value = 0.09161

  1. Please send comments to bottan2@illinois.edu or srmntbr2@illinois.edu