Welcome to e-Tutorial, your on-line help to Econ 536. This issue provides an introduction to model selection in Econometrics, focusing on Akaike (AIC) and Schwarz (SIC) Information Criteria.1
The data set used in this tutorial was borrowed from Johnston and DiNardo’s Econometric Methods (1997, 4th ed), but slightly adjusted for your needs. It is called AUTO2. You can download the data by visiting the Econ 536 web site (Data). As you will see, this adapted data set contains five series.
auto<-read.table("http://www.econ.uiuc.edu/~econ536/Data/AUTO2.txt",header=T)
head(auto)
quarter gas price income miles
1 1959.1 -8.015248 4.675750 -4.505240 2.647592
2 1959.2 -8.011060 4.691292 -4.492739 2.647592
3 1959.3 -8.019878 4.689134 -4.498873 2.647592
4 1959.4 -8.012581 4.722338 -4.491904 2.647592
5 1960.1 -8.016769 4.707470 -4.490103 2.647415
6 1960.2 -7.976376 4.699136 -4.489107 2.647238
As we did before we need to transform the data in “time series” first:
library(dyn)
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
gas<-ts(auto$gas,start=1959,frequency=4)
price<-ts(auto$price,start=1959,frequency=4)
income<-ts(auto$income,start=1959,frequency=4)
miles<-ts(auto$miles,start=1959,frequency=4)
In the PS2, question 1, for that specific data set (which is different than the one used here) you are asked to run a simple dynamic model in the following autorregressive distributed lag form:
\[gas = a_{0} + a_{1} gas_{t-1} + a_{2} \Delta gas_{t-1} + a_{3} price + a_{4} \Delta price + a_{5} \Delta price_{t-1} + a_{6} income + a_{7} \Delta income + a_{8} \Delta income_{t-1} + \epsilon\]
f<-dyn$lm(gas~lag(gas,-1)+lag(diff(gas),-1)+price+diff(price)+lag(diff(price),-1)+income+diff(income)+lag(diff(income),-1))
summary(f)
Call:
lm(formula = dyn(gas ~ lag(gas, -1) + lag(diff(gas), -1) + price +
diff(price) + lag(diff(price), -1) + income + diff(income) +
lag(diff(income), -1)))
Residuals:
Min 1Q Median 3Q Max
-0.074054 -0.007550 0.000483 0.007749 0.045859
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.092968 0.121550 -0.765 0.4459
lag(gas, -1) 0.972190 0.025402 38.272 < 2e-16 ***
lag(diff(gas), -1) -0.178806 0.090730 -1.971 0.0511 .
price -0.018300 0.009621 -1.902 0.0596 .
diff(price) -0.235935 0.037338 -6.319 4.95e-09 ***
lag(diff(price), -1) 0.058411 0.044592 1.310 0.1928
income 0.008281 0.020516 0.404 0.6872
diff(income) 0.272230 0.154974 1.757 0.0816 .
lag(diff(income), -1) 0.044692 0.155294 0.288 0.7740
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01559 on 117 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.9833, Adjusted R-squared: 0.9822
F-statistic: 863.2 on 8 and 117 DF, p-value: < 2.2e-16
The model above is your benchmark. You should now start your model selection process.
Even when there exist commands to calculate the Akaike or the Schwarz criterion, in Econ 536 it is recommended that you compute them by hand, as taught in class, using the formulae given in Prof. Koenker’s Lecture Note 4:
\[AIC=log(\hat{\sigma_j}^2)+\frac{p_i}{n}*2\] \[SIC=log(\hat{\sigma_j}^2)+\frac{p_i}{n}*log(n)\]
In order to calculate the sample size, one could sum the degrees of freedom and the number of estimated parameters, so we have the following:
sample.size<-f$df + length(f$coeff)
Finally:
aic<-log(sum(resid(f)^2)/sample.size)+(length(f$coeff)/sample.size)*2
aic
[1] -8.253146
bic<-log(sum(resid(f)^2)/sample.size)+(length(f$coeff)/sample.size)*log(sample.size)
bic
[1] -8.050554
Next, we just need to compute variations of this model and the respective aic, sic, or both.
To do so it may be best to build your own AIC and BIC function:
aic<-function(model){
n<-model$df + length(model$coeff)
log(sum(resid(model)^2)/n)+(length(model$coeff)/n)*2
}
aic(f)
[1] -8.253146
bic<-function(model){
n<-model$df + length(model$coeff)
log(sum(resid(model)^2)/n)+(length(model$coeff)/n)*log(n)
}
bic(f)
[1] -8.050554
Next we specify the models we want to assess
m1 <- gas ~ lag(gas,-1)
m2 <- gas ~ lag(gas,-1) + price
m3 <- gas ~ lag(gas,-1) + price + diff(gas)
m4 <- gas ~ lag(gas,-1) + price + diff(gas) + lag(diff(gas),-1)
And then we build a loop that calculates and saves the AIC and BIC results in a matrix
M <- c(m1,m2,m3,m4)
A <- array(0,c(4,2))
for(i in 1:4){
g <- dyn$lm(M[[i]])
A[i,1] <- aic(g)
A[i,2] <- bic(g)
}
then we check which model minimizes the criterion functions. The which.min()
function “determines the location, i.e., index of the (first) minimum or maximum of a numeric (or logical) vector.”(see help for which.min
function)
A
[,1] [,2]
[1,] -7.958711 -7.913921
[2,] -7.974809 -7.907623
[3,] -68.421946 -68.332365
[4,] -69.066286 -68.953735
aic_min<-which.min(A[,1])
bic_min<-which.min(A[,2])
model_aic<-dyn$lm(eval(parse(text=paste("m",aic_min,sep=""))))
summary(model_aic)
Warning in summary.lm(model_aic): essentially perfect fit: summary may be
unreliable
Call:
lm(formula = dyn(eval(parse(text = paste("m", aic_min, sep = "")))))
Residuals:
Min 1Q Median 3Q Max
-1.056e-14 -5.540e-17 3.900e-18 2.012e-16 5.909e-16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.139e-14 6.316e-15 1.804e+00 0.0737 .
lag(gas, -1) 1.000e+00 7.567e-16 1.322e+15 <2e-16 ***
price 2.266e-16 5.532e-16 4.100e-01 0.6829
diff(gas) 1.000e+00 4.870e-15 2.053e+14 <2e-16 ***
lag(diff(gas), -1) 7.049e-16 4.839e-15 1.460e-01 0.8844
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.863e-16 on 121 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 4.388e+29 on 4 and 121 DF, p-value: < 2.2e-16
model_bic<-dyn$lm(eval(parse(text=paste("m",bic_min,sep=""))))
summary(model_bic)
Warning in summary.lm(model_bic): essentially perfect fit: summary may be
unreliable
Call:
lm(formula = dyn(eval(parse(text = paste("m", bic_min, sep = "")))))
Residuals:
Min 1Q Median 3Q Max
-1.056e-14 -5.540e-17 3.900e-18 2.012e-16 5.909e-16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.139e-14 6.316e-15 1.804e+00 0.0737 .
lag(gas, -1) 1.000e+00 7.567e-16 1.322e+15 <2e-16 ***
price 2.266e-16 5.532e-16 4.100e-01 0.6829
diff(gas) 1.000e+00 4.870e-15 2.053e+14 <2e-16 ***
lag(diff(gas), -1) 7.049e-16 4.839e-15 1.460e-01 0.8844
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.863e-16 on 121 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 4.388e+29 on 4 and 121 DF, p-value: < 2.2e-16
The last lines run the model for which the AIC and BIC are minimized.
As we previously mentioned most of functions are already programmed in R, that is the case with AIC and BIC. To obtain them you need to estimate a model and then call the function
n <- length(fitted(f))
AIC(f,k=2)
[1] -680.3238
BIC(f)
[1] -651.961
Note that the number given by the AIC/BIC built in function is different than the programmed by ourselves, to check why you should check how the built in function was created. You can do that by typing ?AIC
.
Please send comments to hrtdmrt2@illinois.edu or srmntbr2@illinois.edu↩