Welcome to e-Tutorial, your on-line help to Econ 536. This issue provides an introduction to model selection in Econometrics, focusing on Akaike (AIC) and Schwarz (SIC) Information Criteria.1

Data Set

The data set used in this tutorial was borrowed from Johnston and DiNardo’s Econometric Methods (1997, 4th ed), but slightly adjusted for your needs. It is called AUTO2. You can download the data by visiting the Econ 536 web site (Data). As you will see, this adapted data set contains five series.

    auto<-read.table("http://www.econ.uiuc.edu/~econ536/Data/AUTO2.txt",header=T)
    head(auto)
  quarter       gas    price    income    miles
1  1959.1 -8.015248 4.675750 -4.505240 2.647592
2  1959.2 -8.011060 4.691292 -4.492739 2.647592
3  1959.3 -8.019878 4.689134 -4.498873 2.647592
4  1959.4 -8.012581 4.722338 -4.491904 2.647592
5  1960.1 -8.016769 4.707470 -4.490103 2.647415
6  1960.2 -7.976376 4.699136 -4.489107 2.647238

As we did before we need to transform the data in “time series” first:

    library(dyn)
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
    gas<-ts(auto$gas,start=1959,frequency=4)
    price<-ts(auto$price,start=1959,frequency=4)
    income<-ts(auto$income,start=1959,frequency=4)
    miles<-ts(auto$miles,start=1959,frequency=4)

Running a Generic Dynamic Models

In the PS2, question 1, for that specific data set (which is different than the one used here) you are asked to run a simple dynamic model in the following autorregressive distributed lag form:

\[gas = a_{0} + a_{1} gas_{t-1} + a_{2} \Delta gas_{t-1} + a_{3} price + a_{4} \Delta price + a_{5} \Delta price_{t-1} + a_{6} income + a_{7} \Delta income + a_{8} \Delta income_{t-1} + \epsilon\]

   f<-dyn$lm(gas~lag(gas,-1)+lag(diff(gas),-1)+price+diff(price)+lag(diff(price),-1)+income+diff(income)+lag(diff(income),-1))
    summary(f)

Call:
lm(formula = dyn(gas ~ lag(gas, -1) + lag(diff(gas), -1) + price + 
    diff(price) + lag(diff(price), -1) + income + diff(income) + 
    lag(diff(income), -1)))

Residuals:
      Min        1Q    Median        3Q       Max 
-0.074054 -0.007550  0.000483  0.007749  0.045859 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           -0.092968   0.121550  -0.765   0.4459    
lag(gas, -1)           0.972190   0.025402  38.272  < 2e-16 ***
lag(diff(gas), -1)    -0.178806   0.090730  -1.971   0.0511 .  
price                 -0.018300   0.009621  -1.902   0.0596 .  
diff(price)           -0.235935   0.037338  -6.319 4.95e-09 ***
lag(diff(price), -1)   0.058411   0.044592   1.310   0.1928    
income                 0.008281   0.020516   0.404   0.6872    
diff(income)           0.272230   0.154974   1.757   0.0816 .  
lag(diff(income), -1)  0.044692   0.155294   0.288   0.7740    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01559 on 117 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.9833,    Adjusted R-squared:  0.9822 
F-statistic: 863.2 on 8 and 117 DF,  p-value: < 2.2e-16

The model above is your benchmark. You should now start your model selection process.

Even when there exist commands to calculate the Akaike or the Schwarz criterion, in Econ 536 it is recommended that you compute them by hand, as taught in class, using the formulae given in Prof. Koenker’s Lecture Note 4:

\[AIC=log(\hat{\sigma_j}^2)+\frac{p_i}{n}*2\] \[SIC=log(\hat{\sigma_j}^2)+\frac{p_i}{n}*log(n)\]

In order to calculate the sample size, one could sum the degrees of freedom and the number of estimated parameters, so we have the following:

    sample.size<-f$df + length(f$coeff)

Finally:

    aic<-log(sum(resid(f)^2)/sample.size)+(length(f$coeff)/sample.size)*2
    aic
[1] -8.253146
    bic<-log(sum(resid(f)^2)/sample.size)+(length(f$coeff)/sample.size)*log(sample.size)
    bic
[1] -8.050554

Next, we just need to compute variations of this model and the respective aic, sic, or both.

To do so it may be best to build your own AIC and BIC function:

    aic<-function(model){
    n<-model$df + length(model$coeff)
    log(sum(resid(model)^2)/n)+(length(model$coeff)/n)*2
    }
    aic(f)
[1] -8.253146
    bic<-function(model){
    n<-model$df + length(model$coeff)
    log(sum(resid(model)^2)/n)+(length(model$coeff)/n)*log(n)
    }
    bic(f)
[1] -8.050554

Next we specify the models we want to assess

    m1 <- gas ~ lag(gas,-1) 
    m2 <- gas ~ lag(gas,-1) + price
    m3 <- gas ~ lag(gas,-1) + price + diff(gas)
    m4 <- gas ~ lag(gas,-1) + price + diff(gas) + lag(diff(gas),-1)

And then we build a loop that calculates and saves the AIC and BIC results in a matrix

    M <- c(m1,m2,m3,m4)

    A <- array(0,c(4,2))

    for(i in 1:4){
        g <- dyn$lm(M[[i]])
        A[i,1] <- aic(g)
        A[i,2] <- bic(g)
    }

then we check which model minimizes the criterion functions. The which.min() function “determines the location, i.e., index of the (first) minimum or maximum of a numeric (or logical) vector.”(see help for which.min function)

    A
           [,1]       [,2]
[1,]  -7.958711  -7.913921
[2,]  -7.974809  -7.907623
[3,] -68.421946 -68.332365
[4,] -69.066286 -68.953735
    aic_min<-which.min(A[,1])
    bic_min<-which.min(A[,2])
    
    
    model_aic<-dyn$lm(eval(parse(text=paste("m",aic_min,sep=""))))
    summary(model_aic)
Warning in summary.lm(model_aic): essentially perfect fit: summary may be
unreliable

Call:
lm(formula = dyn(eval(parse(text = paste("m", aic_min, sep = "")))))

Residuals:
       Min         1Q     Median         3Q        Max 
-1.056e-14 -5.540e-17  3.900e-18  2.012e-16  5.909e-16 

Coefficients:
                    Estimate Std. Error   t value Pr(>|t|)    
(Intercept)        1.139e-14  6.316e-15 1.804e+00   0.0737 .  
lag(gas, -1)       1.000e+00  7.567e-16 1.322e+15   <2e-16 ***
price              2.266e-16  5.532e-16 4.100e-01   0.6829    
diff(gas)          1.000e+00  4.870e-15 2.053e+14   <2e-16 ***
lag(diff(gas), -1) 7.049e-16  4.839e-15 1.460e-01   0.8844    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.863e-16 on 121 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 4.388e+29 on 4 and 121 DF,  p-value: < 2.2e-16
    model_bic<-dyn$lm(eval(parse(text=paste("m",bic_min,sep=""))))    
    summary(model_bic)
Warning in summary.lm(model_bic): essentially perfect fit: summary may be
unreliable

Call:
lm(formula = dyn(eval(parse(text = paste("m", bic_min, sep = "")))))

Residuals:
       Min         1Q     Median         3Q        Max 
-1.056e-14 -5.540e-17  3.900e-18  2.012e-16  5.909e-16 

Coefficients:
                    Estimate Std. Error   t value Pr(>|t|)    
(Intercept)        1.139e-14  6.316e-15 1.804e+00   0.0737 .  
lag(gas, -1)       1.000e+00  7.567e-16 1.322e+15   <2e-16 ***
price              2.266e-16  5.532e-16 4.100e-01   0.6829    
diff(gas)          1.000e+00  4.870e-15 2.053e+14   <2e-16 ***
lag(diff(gas), -1) 7.049e-16  4.839e-15 1.460e-01   0.8844    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.863e-16 on 121 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 4.388e+29 on 4 and 121 DF,  p-value: < 2.2e-16

The last lines run the model for which the AIC and BIC are minimized.

As we previously mentioned most of functions are already programmed in R, that is the case with AIC and BIC. To obtain them you need to estimate a model and then call the function

    n <- length(fitted(f))
    AIC(f,k=2)
[1] -680.3238
    BIC(f)
[1] -651.961

Note that the number given by the AIC/BIC built in function is different than the programmed by ourselves, to check why you should check how the built in function was created. You can do that by typing ?AIC.


  1. Please send comments to hrtdmrt2@illinois.edu or srmntbr2@illinois.edu