Welcome
to the fifth issue of e-Tutorial, the on-line help to Econ 508. This issue provides
an introduction to model selection in Econometrics, focusing on Akaike (AIC)
and Schwarz (SIC) Information Criteria.
Data Set:
The data set used in this
tutorial was borrowed from Johnston and DiNardo's Econometric Methods
(1997, 4th ed), but slightly adjusted for your needs. It is called AUTO2 and
it is already in STATA format. You can download the data by clicking here or simply visiting the Econ
508 web site (Data). As you will see, this
adapted data set contains five series, namely:
quarter
Quarter of the observation (from 1959.1 to 1990.4)
gas
Log of per capita real expenditure on gasoline and oil
price
Log of the real price of gasoline and oil
income
Log of per capita real disposable income
miles
Log of miles per gallon
Transforming the
Data in Time Series Format:
The next step is to create
the variables you will need to run dynamic models. As mentioned before, you
need to generate a variable corresponding to the time period of each
observation (which can not be ?quarter? because it contains non-integer
values):
gen
t=_n
label
variable t "Integer time period"
Next you
"tsset" your data using the created variable:
tsset
t
time variable: t, 1 to 128
Running a Generic
Dynamic Model:
In the PS2, question 1, for
that specific data set (which is different than the one used here) you are
asked to run a simple dynamic model in the following autorregressive
distributed lag form:
gas
= a0 + a1 L.gas + a2 LD.gas + a3 price
+ a4 D.price + a5 DL.price +
a6
income + a7 D.income
+ a8 DL.income + error
In STATA, you can run
this model as follows:
regress
gas L.gas LD.gas price D.price LD.price
income D.income LD.income
And the output for the
data set used here (auto2.dta) will be:
Source
| SS
df
MS
Number of obs = 126
---------+------------------------------
F( 8, 117) = 863.16
Model | 1.67892182 8
.209865228
Prob > F = 0.0000
Residual
| .028446793 117
.000243135
R-squared = 0.9833
---------+------------------------------
Adj R-squared = 0.9822
Total | 1.70736862 125 .013658949
Root MSE = .01559
------------------------------------------------------------------------------
gas
| Coef. Std.
Err. t
P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
gas
|
L1 | .9721906 .0254024
38.272 0.000
.9218825 1.022499
LD | -.1788088 .0907299
-1.971 0.051
-.3584947 .0008771
price
|
-- | -.0183001 .0096211
-1.902 0.060
-.0373542 .000754
D1 | -.2359339 .0373382
-6.319 0.000 -.3098801
-.1619876
LD | .0584094 .0445919
1.310 0.193
-.0299024 .1467213
income
|
-- | .0082806 .0205162
0.404 0.687
-.0323507 .0489119
D1 | .2722332 .1549735
1.757 0.082
-.0346836 .5791501
LD | .0446936 .1552938
0.288 0.774
-.2628576 .3522449
_cons
| -.0929674 .12155
-0.765 0.446
-.3336909 .1477561
------------------------------------------------------------------------------
The model above is your benchmark.
You should now start your model selection process.
Even when there exist
commands to calculate the Akaike or the Schwarz criterion, in Econ 508 it is
recommended that you compute them by hand, as taught in class, using the
formulae given in Prof. Koenker's Lecture Note 4, page 3, or the ones
provided by Johnston and DiNardo (1997, p. 74).
In STATA, you can
calculate various information criteria and other important statistics using
functions to extract matrices and scalars generated by the regression
operation:
Sample
size: after
regress, type the command scalar A=_result(1)
Model
SS:
after regress, type the command scalar
B=_result(2)
Model
df:
after regress, type the command scalar
C=_result(3)
Residual
SS: after
regress, type the command scalar D=_result(4)
Residual
df: after
regress, type the command scalar E=_result(5)
F-statistic:
after regress, type the command scalar
F=_result(6)
R-squared:
after regress, type the command scalar
G=_result(7)
Adjusted
R-sq: after regress, type the
command scalar H=_result(8)
Root
MSE:
after regress, type the command scalar
I=_result(9)
Coefficients:
after regress, type the command matrix b=get(_b)
# of
Parameters: after getting the matrix b,
type scalar k=colsof(b)
Covariance
matrix: after regress, type the
command matrix v=get(VCE)
You can see what you
generated simply as follows:
scalar
list A B C D E F G H I k
A = 126
B = 1.6789218
C = 8
D = .02844679
E = 117
F = 863.16345
G = .98333881
H = .98219958
I = .01559279
k = 9
matrix
list b
b[1,9]
L.
LD.
D. LD.
gas
gas
price
price price
income
y1
.9721906 -.17880883 -.01830014 -.23593387
.05840944 .0082806
D. LD.
income income
_cons
y1
.27223322 .04469364 -.09296738
matrix
list v
symmetric
v[9,9]
L.
LD.
D. LD.
gas
gas
price price
price
L.gas .00064528
LD.gas -.00044148 .00823192
price .0000944 7.738e-06 .00009257
D.price -.00013958 .00013756 -.00003085
.00139414
LD.price
-.00013149 .00182434 -.00005647
-.00046606 .00198843
income -.00045971 .00041162 -.00008305
.00008961 .00015336
D.income
.00048852 -.00068779 .00018875
.00096361 .00057653
LD.income
.00047985 -.00230927 .00018465 -.00049299
.00089038
_cons .00263343 -.00174625 -.00005413
-.0005658 -.00012155
D. LD.
income income
income _cons
income .00042091
D.income
-.00023978 .02401678
LD.income
-.00024082 .00139978 .02411617
_cons -.00141118 .00179191
.00174108 .01477441
Akaike Information
Criterion:
You can get the Akaike Information Criterion as
follows:
scalar
AIC=log(_result(4)/_result(1))+(colsof(b)/_result(1))*2
scalar
list AIC
AIC = -8.2531446
Schwarz Information
Criterion:
You can get the Schwarz Information Criterion as
follows:
scalar
SIC=log(_result(4)/_result(1))+(colsof(b)/_result(1))*log(_result(1))
scalar
list SIC
SIC = -8.0505531
Programming in STATA,
I: How to Obtain Information Criteria
To help you on the model
selection, I wrote a small routine that computes AIC and SIC after each round
of regressions. This is called AICSIC.do. To download the program simply
follow the steps below:
a) Click here to save the file.
b) Open the program in the
STATA Do-file editor (little envelope icon in the toolbar), in Notepad or in
any other text editor.
c) Start editing the file.
The first thing to notice is that lines preceded by * are only descriptive -
they do not interfere in your routine.
d) To capture the correct
data set you will use, you can either replace the path where you've saved the
data, or you can first open the respective data in STATA, and next run the
Do-file.
e) An example is as follows:
if you have saved the problem set 2 data in a floppy disk, replace the first
line by use "A:\gasnew.dat", and add a line to infile the
data in STATA, with the respective variables' names. You can look at
e-Tutorial 1, if you forgot how to do that... Even simpler is to save gasnew.dat
in STATA format, and open it before running the Do-file. In this case, just
add * before the line with the path to the data set.
f) Adjust the variables'
names in your Do-file according to the respective data set.
g) Add or reduce steps on
the model selection process, or provide different combinations you believe
are reasonable. But don't rely too much on a large number of combinations;
after all, economic theory will guide you as well on the model selection
process.
h) Save your own version of
Do-file (to avoid excessive havoc, use a different file name on it).
g) In Stata, go to File,
Do..., and open your new Do-file. It runs automatically and the results will
be shown on your screen. Because we are using the dual version of AIC and
SIC, the best model among those included presents the lowest AIC, SIC.
**Once again, recall that
in this tutorial I have used a different data set with different variables.
So, you need to make your own adjustments and generate your own version of a
Do-file.**
See below how the code
looks like:
**********************************************************************************************
* A
small do-file to calculate AIC and SIC in STATA 6.0
* use
"C:\Econ472\auto2.dta"
* gen
t=_n
*
label variable t "Integer time period"
*
tsset t
*
*
Model 1.1: Full Model
regress
gas L.gas LD.gas price D.price LD.price
income D.income LD.income
matrix
b1=get(_b)
scalar
AIC1=log(_result(4)/_result(1))+(colsof(b1)/_result(1))*2
scalar
SIC1=log(_result(4)/_result(1))+(colsof(b1)/_result(1))*log(_result(1))
*
*
Model 1.2: Drop LD.income
regress
gas L.gas LD.gas price D.price LD.price
income D.income
matrix
b2=get(_b)
scalar
AIC2=log(_result(4)/_result(1))+(colsof(b2)/_result(1))*2
scalar
SIC2=log(_result(4)/_result(1))+(colsof(b2)/_result(1))*log(_result(1))
*
*
Model 1.3: Drop LD.price
regress
gas L.gas LD.gas price
D.price
income D.income LD.income
matrix
b3=get(_b)
scalar
AIC3=log(_result(4)/_result(1))+(colsof(b3)/_result(1))*2
scalar
SIC3=log(_result(4)/_result(1))+(colsof(b3)/_result(1))*log(_result(1))
*
*
Model 1.4: Drop LD.gas
regress
gas L.gas
price D.price LD.price income D.income
LD.income
matrix
b4=get(_b)
scalar
AIC4=log(_result(4)/_result(1))+(colsof(b4)/_result(1))*2
scalar
SIC4=log(_result(4)/_result(1))+(colsof(b4)/_result(1))*log(_result(1))
*
*
Model 1.5: Drop LD.price, LD.income
regress
gas L.gas LD.gas price
D.price
income D.income
matrix
b5=get(_b)
scalar
AIC5=log(_result(4)/_result(1))+(colsof(b5)/_result(1))*2
scalar
SIC5=log(_result(4)/_result(1))+(colsof(b5)/_result(1))*log(_result(1))
*
*
Model 1.6: Drop LD.gas, LD.income
regress
gas L.gas price
D.price LD.price income D.income
matrix
b6=get(_b)
scalar
AIC6=log(_result(4)/_result(1))+(colsof(b6)/_result(1))*2
scalar
SIC6=log(_result(4)/_result(1))+(colsof(b6)/_result(1))*log(_result(1))
*
*
Model 1.7: Drop LD.gas, LD.price
regress
gas L.gas
price
D.price
income D.income LD.income
matrix
b7=get(_b)
scalar
AIC7=log(_result(4)/_result(1))+(colsof(b7)/_result(1))*2
scalar
SIC7=log(_result(4)/_result(1))+(colsof(b7)/_result(1))*log(_result(1))
*
* Model
1.8: Drop LD.gas, LD.price, LD.income
regress
gas L.gas
price
D.price
income D.income
matrix
b8=get(_b)
scalar
AIC8=log(_result(4)/_result(1))+(colsof(b8)/_result(1))*2
scalar
SIC8=log(_result(4)/_result(1))+(colsof(b8)/_result(1))*log(_result(1))
*
*
Model 1.9: Drop LD.gas, LD.price, D.income, LD.income
regress
gas L.gas
price
D.price
income
matrix
b9=get(_b)
scalar
AIC9=log(_result(4)/_result(1))+(colsof(b9)/_result(1))*2
scalar
SIC9=log(_result(4)/_result(1))+(colsof(b9)/_result(1))*log(_result(1))
*
*
Model 1.10: Drop LD.gas, D.price, LD.price, LD.income
regress
gas L.gas
price
income D.income
matrix
b10=get(_b)
scalar
AIC10=log(_result(4)/_result(1))+(colsof(b10)/_result(1))*2
scalar
SIC10=log(_result(4)/_result(1))+(colsof(b10)/_result(1))*log(_result(1))
*
*
Model 1.11: Drop LD.gas, D.price, LD.price, D.income, LD.income
regress
gas L.gas
price
income
matrix
b11=get(_b)
scalar
AIC11=log(_result(4)/_result(1))+(colsof(b11)/_result(1))*2
scalar
SIC11=log(_result(4)/_result(1))+(colsof(b11)/_result(1))*log(_result(1))
*
*
Model 1.12: Drop all lags and differences
regress
gas
price
income
matrix
b12=get(_b)
scalar
AIC12=log(_result(4)/_result(1))+(colsof(b12)/_result(1))*2
scalar
SIC12=log(_result(4)/_result(1))+(colsof(b12)/_result(1))*log(_result(1))
*
* List
all calculated AICs and SICs
scalar
list
clear
***********************************************************************************************
For the data set
auto2.dta, the results you will obtain are:
. *
List all calculated AICs and SICs
.
scalar list
SIC12 = -5.6360039
AIC12 = -5.7028484
SIC11 = -7.8889369
AIC11 = -7.9785176
SIC10 = -7.8959571
AIC10 = -8.007933
SIC9 = -8.113345
AIC9 = -8.2253208
SIC8 = -8.090258
AIC8 = -8.2246291
SIC7 = -8.0456909
AIC7 = -8.2032621
SIC6 = -8.0946516
AIC6 = -8.2522228
SIC5 = -8.1126393
AIC5 = -8.2702105
SIC4 = -8.056279
AIC4 = -8.2363604
SIC3 = -8.0743782
AIC3 = -8.2544596
SIC2 = -8.0882286
AIC2 = -8.2683099
SIC1 = -8.0505531
AIC1 = -8.2531446
That's it. In class I'll
also discuss the relation of AIC, SIC with the Mean Squared Error, along with
some insights on what to do when the loss function is flat (i.e., the
information criteria for different specifications are very close of each
other).
USING R:
The procedure to
calculate AIC and SIC in R is almost the same as in Stata. We need to
estimate the model and then compute them. Start by loading the libraries dyn
and stats and transforming the data:
library(dyn)
library(stats)
d.gas<-read.table("http://www.econ.uiuc.edu/~econ472/AUTO2.txt",header=T)
attach(d.gas)
gas<-ts(gas,start=1959,frequency=4)
price<-ts(price,start=1959,frequency=4)
income<-ts(income,start=1959,frequency=4)
miles<-ts(miles,start=1959,frequency=4)
Here we reproduce the
same example as given above. So, compute
f<-dyn$lm(gas~lag(gas,-1)+lag(diff(gas),-1)+price+diff(price)+lag(diff(price),-1)+
income+diff(income)+lag(diff(income),-1))
summary(f)
Call:
lm(formula =
dyn(gas ~ lag(gas, -1) + lag(diff(gas), -1) + price +
diff(price) +
lag(diff(price), -1) + income + diff(income) +
lag(diff(income), -1)))
Residuals:
Min
1Q Median
3Q
Max
-0.074054
-0.007550 0.000483 0.007749 0.045859
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-0.092968
0.121550 -0.765 0.4459
lag(gas,
-1)
0.972190
0.025402 38.272 < 2e-16 ***
lag(diff(gas),
-1) -0.178806 0.090730 -1.971 0.0511 .
price
-0.018300
0.009621 -1.902 0.0596 .
diff(price)
-0.235935
0.037338 -6.319 4.95e-09
***
lag(diff(price),
-1) 0.058411 0.044592 1.310 0.1928
income
0.008281 0.020516 0.404 0.6872
diff(income)
0.272230
0.154974 1.757 0.0816 .
lag(diff(income),
-1) 0.044692 0.155294 0.288 0.7740
---
Signif.
codes: 0 '***' 0.001 '**' 0.01
'*' 0.05 '.' 0.1 ' ' 1
Residual
standard error: 0.01559 on 117 degrees of freedom
Multiple
R-Squared: 0.9833,
Adjusted R-squared: 0.9822
F-statistic:
863.2 on 8 and 117 DF, p-value:
< 2.2e-16
In order to calculate the
sample size, one could sum the degrees of freedom and the number of estimated
parameters, so we have the following:
sample.size<-f$df
+ length(f$coeff)
Finally:
aic<-log(sum(resid(f)^2)/sample.size)+(length(f$coeff)/sample.size)*2
aic
[1]
-8.253146
sic<-log(sum(resid(f)^2)/sample.size)+(length(f$coeff)/sample.size)*
log(sample.size)
sic
[1]
-8.050554
Now, as
described in the Stata procedure above, we just need to compute variations of
this model and the respective aic, sic, or both.
|