Applied econometrics: Econ 508
logo

Applied Econometrics 
Econ 508 - Fall 2014

Professor: Roger Koenker 

TA: Nicolas Bottan 

Welcome to e-Tutorial, your on-line help to Econ508. This issue provides an introduction to model selection in Econometrics, focusing on Akaike (AIC) and Schwarz (SIC) Information Criteria.1

Data Set

The data set used in this tutorial was borrowed from Johnston and DiNardo’s Econometric Methods (1997, 4th ed), but slightly adjusted for your needs. It is called AUTO2. You can download the data by visiting the Econ 508 web site (Data). As you will see, this adapted data set contains five series.

    auto<-read.table("http://www.econ.uiuc.edu/~econ508/data/AUTO2.txt",header=T)
    head(auto)
  quarter    gas price income miles
1    1959 -8.015 4.676 -4.505 2.648
2    1959 -8.011 4.691 -4.493 2.648
3    1959 -8.020 4.689 -4.499 2.648
4    1959 -8.013 4.722 -4.492 2.648
5    1960 -8.017 4.707 -4.490 2.647
6    1960 -7.976 4.699 -4.489 2.647

As we did before we need to transform the data in “time series” first:

    library(dyn)
Loading required package: zoo

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
    gas<-ts(auto$gas,start=1959,frequency=4)
    price<-ts(auto$price,start=1959,frequency=4)
    income<-ts(auto$income,start=1959,frequency=4)
    miles<-ts(auto$miles,start=1959,frequency=4)

Running a Generic Dynamic Models

In the PS2, question 1, for that specific data set (which is different than the one used here) you are asked to run a simple dynamic model in the following autorregressive distributed lag form:

\[gas = a_{0} + a_{1} gas_{t-1} + a_{2} \Delta gas_{t-1} + a_{3} price + a_{4} \Delta price + a_{5} \Delta price_{t-1} + a_{6} income + a_{7} \Delta income + a_{8} \Delta income_{t-1} + \epsilon\]

   f<-dyn$lm(gas~lag(gas,-1)+lag(diff(gas),-1)+price+diff(price)+lag(diff(price),-1)+income+diff(income)+lag(diff(income),-1))
    summary(f)

Call:
lm(formula = dyn(gas ~ lag(gas, -1) + lag(diff(gas), -1) + price + 
    diff(price) + lag(diff(price), -1) + income + diff(income) + 
    lag(diff(income), -1)))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.07405 -0.00755  0.00048  0.00775  0.04586 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           -0.09297    0.12155   -0.76    0.446    
lag(gas, -1)           0.97219    0.02540   38.27  < 2e-16 ***
lag(diff(gas), -1)    -0.17881    0.09073   -1.97    0.051 .  
price                 -0.01830    0.00962   -1.90    0.060 .  
diff(price)           -0.23593    0.03734   -6.32  4.9e-09 ***
lag(diff(price), -1)   0.05841    0.04459    1.31    0.193    
income                 0.00828    0.02052    0.40    0.687    
diff(income)           0.27223    0.15497    1.76    0.082 .  
lag(diff(income), -1)  0.04469    0.15529    0.29    0.774    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0156 on 117 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.983, Adjusted R-squared:  0.982 
F-statistic:  863 on 8 and 117 DF,  p-value: <2e-16

The model above is your benchmark. You should now start your model selection process.

Even when there exist commands to calculate the Akaike or the Schwarz criterion, in Econ 508 it is recommended that you compute them by hand, as taught in class, using the formulae given in Prof. Koenker’s Lecture Note 4:

\[AIC=log(\hat{\sigma_j}^2)+\frac{p_i}{n}*2\] \[SIC=log(\hat{\sigma_j}^2)+\frac{p_i}{n}*log(n)\]

In order to calculate the sample size, one could sum the degrees of freedom and the number of estimated parameters, so we have the following:

    sample.size<-f$df + length(f$coeff)

Finally:

    aic<-log(sum(resid(f)^2)/sample.size)+(length(f$coeff)/sample.size)*2
    aic
[1] -8.253
    bic<-log(sum(resid(f)^2)/sample.size)+(length(f$coeff)/sample.size)*log(sample.size)
    bic
[1] -8.051

Next, we just need to compute variations of this model and the respective aic, sic, or both.

To do so it may be best to build your own AIC and BIC function:

    aic<-function(model){
    n<-model$df + length(model$coeff)
    log(sum(resid(model)^2)/n)+(length(model$coeff)/n)*2
    }
    aic(f)
[1] -8.253
    bic<-function(model){
    n<-model$df + length(model$coeff)
    log(sum(resid(model)^2)/n)+(length(model$coeff)/n)*log(n)
    }
    bic(f)
[1] -8.051

Next we specify the models we want to assess

    m1 <- gas ~ lag(gas,-1) 
    m2 <- gas ~ lag(gas,-1) + price
    m3 <- gas ~ lag(gas,-1) + price + diff(gas)
    m4 <- gas ~ lag(gas,-1) + price + diff(gas) + lag(diff(gas),-1)

And then we build a loop that calculates and saves the AIC and BIC results in a matrix

    M <- c(m1,m2,m3,m4)

    A <- array(0,c(4,2))

    for(i in 1:4){
        g <- dyn$lm(M[[i]])
        A[i,1] <- aic(g)
        A[i,2] <- bic(g)
    }

then we check which model minimizes the criterion functions. The which.min() function “determines the location, i.e., index of the (first) minimum or maximum of a numeric (or logical) vector.”(see help for which.min function)

    A
        [,1]    [,2]
[1,]  -7.959  -7.914
[2,]  -7.975  -7.908
[3,] -68.422 -68.332
[4,] -69.066 -68.954
    aic_min<-which.min(A[,1])
    bic_min<-which.min(A[,2])
    
    
    model_aic<-dyn$lm(eval(parse(text=paste("m",aic_min,sep=""))))
    summary(model_aic)
Warning: essentially perfect fit: summary may be unreliable

Call:
lm(formula = dyn(eval(parse(text = paste("m", aic_min, sep = "")))))

Residuals:
      Min        1Q    Median        3Q       Max 
-1.06e-14 -5.50e-17  4.00e-18  2.01e-16  5.91e-16 

Coefficients:
                   Estimate Std. Error  t value Pr(>|t|)    
(Intercept)        1.14e-14   6.32e-15 1.80e+00    0.074 .  
lag(gas, -1)       1.00e+00   7.57e-16 1.32e+15   <2e-16 ***
price              2.27e-16   5.53e-16 4.10e-01    0.683    
diff(gas)          1.00e+00   4.87e-15 2.05e+14   <2e-16 ***
lag(diff(gas), -1) 7.05e-16   4.84e-15 1.50e-01    0.884    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.86e-16 on 121 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:     1,  Adjusted R-squared:     1 
F-statistic: 4.39e+29 on 4 and 121 DF,  p-value: <2e-16
    model_bic<-dyn$lm(eval(parse(text=paste("m",bic_min,sep=""))))    
    summary(model_bic)
Warning: essentially perfect fit: summary may be unreliable

Call:
lm(formula = dyn(eval(parse(text = paste("m", bic_min, sep = "")))))

Residuals:
      Min        1Q    Median        3Q       Max 
-1.06e-14 -5.50e-17  4.00e-18  2.01e-16  5.91e-16 

Coefficients:
                   Estimate Std. Error  t value Pr(>|t|)    
(Intercept)        1.14e-14   6.32e-15 1.80e+00    0.074 .  
lag(gas, -1)       1.00e+00   7.57e-16 1.32e+15   <2e-16 ***
price              2.27e-16   5.53e-16 4.10e-01    0.683    
diff(gas)          1.00e+00   4.87e-15 2.05e+14   <2e-16 ***
lag(diff(gas), -1) 7.05e-16   4.84e-15 1.50e-01    0.884    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.86e-16 on 121 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:     1,  Adjusted R-squared:     1 
F-statistic: 4.39e+29 on 4 and 121 DF,  p-value: <2e-16

The last lines run the model for which the AIC and BIC are minimized.

As we previously mentioned most of functions are already programmed in R, that is the case with AIC and BIC. To obtain them you need to estimate a model and then call the function

    n <- length(fitted(f))
    AIC(f,k=2)
[1] -680.3
    BIC(f)
[1] -652

Note that the number given by the AIC/BIC built in function is different than the programmed by ourselves, to check why you should check how the built in function was created. You can do that by typing ?AIC.


  1. Please send comments to bottan2@illinois.edu or srmntbr2@illinois.edu