Welcome to e-Tutorial, your on-line help to Econ 536. The present issue focuses on the basic operations of R. The introductory material presented below is designed to enhance your understanding of the topics and your performance on the homework. This issue focuses on the basic features of Box-Cox transformations and Partial Residual Plots.1

Introduction

In problem set 1, question 1, you are asked to estimate two demand equations for bread using the data set available here (or if you prefer, visit the data set collection at the Econ 536 web page, under the name “giffen”). You can save the data using the techniques suggested on e-Tutorial 1. As a general guideline, I suggest you to script your work, and if you are bold enough I encourage you to use Sweave or Knit.

Parts (i)-(iii) of the problem set involve simple linear regression and hypothesis testing that should be straightforward to solve once you are familiar with R. For every hypothesis testing, please make clear what are the null and alternative hypotheses. Please also provide a simple table with the main estimation results. I recommend you to ALWAYS include standard deviations for ALL parameters you estimate. Remember the first rule of empirical paper writing: “All good estimates deserve a standard error”. Another useful advice is to summarize your main conclusions. It is strongly encouraged that you structure your problem set as a paper. Finally, graphs are very welcome as long as you provide labels and refer to them on your comments. Don’t include any remaining material (e.g., software output or your preliminary regressions) in your report.

Partial Residual Plot

Question 1, part (iv) requires you to compare the plots of the Engel curves for bread in the “short” and “long” versions of the model using partial residual plot for the latter model. As mentioned in Professor Koenker’s Lecture 2, “the partial residual plot is a device for representing the final step of a multivariate regression result as a bivariate scatterplot.” Here is how you do that:

Theorem: (Gauss-Frisch-Waugh)

Recall the results of the Gauss-Frisch-Waugh theorem in Professor Koenker’s Lecture Note 2 (pages 8-9). Here you will see that you can obtain the same coefficient and standard deviation for a given covariate by using partial residual regression. I will show the result using the gasoline demand data available here. In this data set, Y corresponds to the log per capita gas consumption, P is the log gas price, and Z is the log per capita income. The variables are already in the logarithmic form, so that we are actually estimating log-linear models.

At first you run the full model and observe the coefficient and standard deviation of P.

    gas<-read.table("http://www.econ.uiuc.edu/~econ536/Data/gasnew.txt",header=T)
    head(gas)
     year     ln.p.    ln.z.     ln.y.
1 1947.00 -1.868643 1.772658 -1.494745
2 1947.25 -1.824390 1.744589 -1.499083
3 1947.50 -1.794202 1.751633 -1.492583
4 1947.75 -1.752391 1.738420 -1.517380
5 1948.00 -1.694881 1.749078 -1.537850
6 1948.25 -1.679898 1.754818 -1.531699
    y<-gas$ln.y.
    p<-gas$ln.p.
    z<-gas$ln.z.

     
    #Full Model
    FM<-lm(y~p+z)
    summary(FM)

Call:
lm(formula = y ~ p + z)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.217743 -0.042519  0.006042  0.061758  0.170490 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.26650    0.11217  -46.95   <2e-16 ***
p           -0.40753    0.01964  -20.75   <2e-16 ***
z            1.75935    0.04194   41.95   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.07936 on 198 degrees of freedom
Multiple R-squared:  0.9442,    Adjusted R-squared:  0.9436 
F-statistic:  1674 on 2 and 198 DF,  p-value: < 2.2e-16

Then you run a shorter version of the model, excluding P (My). After that you run another short model, but in this case you regress the omitted variable P on the same covariates of the previous model (Mx).

    #My
    My<-lm(y~z)
    summary(My)

Call:
lm(formula = y ~ z)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.43145 -0.03959  0.02499  0.06497  0.28415 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.10639    0.07415  -41.89   <2e-16 ***
z            0.97359    0.03203   30.40   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.141 on 199 degrees of freedom
Multiple R-squared:  0.8228,    Adjusted R-squared:  0.8219 
F-statistic:   924 on 1 and 199 DF,  p-value: < 2.2e-16
    #Mx
    Mx<-lm(p~z)
    summary(Mx)

Call:
lm(formula = p ~ z)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.62496 -0.17625  0.06038  0.16331  0.61074 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.30048    0.15058  -35.20   <2e-16 ***
z            1.92810    0.06504   29.64   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2864 on 199 degrees of freedom
Multiple R-squared:  0.8154,    Adjusted R-squared:  0.8144 
F-statistic: 878.7 on 1 and 199 DF,  p-value: < 2.2e-16

Finally, you regress the residuals of the model My on the residuals of the model Mx,

    pr<-lm(resid(My)~resid(Mx))
    summary(pr)

Call:
lm(formula = resid(My) ~ resid(Mx))

Residuals:
      Min        1Q    Median        3Q       Max 
-0.217743 -0.042519  0.006042  0.061758  0.170490 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.546e-17  5.583e-03     0.0        1    
resid(Mx)   -4.075e-01  1.959e-02   -20.8   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.07916 on 199 degrees of freedom
Multiple R-squared:  0.6849,    Adjusted R-squared:  0.6833 
F-statistic: 432.6 on 1 and 199 DF,  p-value: < 2.2e-16

Next we can plot those residuals and insert a fitting line:

    plot(resid(Mx),resid(My),main="Partial Residuals",xlab="Gasoline Price",ylab="Per Capita Gas Consumption")
    abline(pr)
    abline(v=0,lty=2)
    abline(h=0,lty=2)

Box-Cox Transformation:

For question 2, parts (a)-(d) are also straightforward. You are expected to calculate the estimates in both linear and log-linear form. Besides them, you are expected to run a Box-Cox version of the model, and interpret it. Here I will give you some help by using the same gasoline demand data as above.

Just for a minute, suppose somebody told you that a nice gasoline demand equation should also include two additional covariates: the squared price of gas, and the effect of price times income. You can obtain those variables as follows:

    p2<-p^2
    pz<-p*z

Next you are asked to run this extended model, in a traditional log-linear form (remember that all covariates are already in logs). So, the easiest way to do that is as follows:

    summary(lm(y~p+z+p2+pz))

Call:
lm(formula = y ~ p + z + p2 + pz)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.18545 -0.03945  0.00715  0.03976  0.13905 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -7.39147    0.20321 -36.373  < 2e-16 ***
p           -2.57368    0.18581 -13.851  < 2e-16 ***
z            2.57118    0.07764  33.115  < 2e-16 ***
p2          -0.27127    0.03754  -7.227 1.08e-11 ***
pz           0.71702    0.05931  12.090  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06005 on 196 degrees of freedom
Multiple R-squared:  0.9684,    Adjusted R-squared:  0.9677 
F-statistic:  1499 on 4 and 196 DF,  p-value: < 2.2e-16

The log-linear form seems to be a nice attempt to estimate the gasoline demand.
Next, suppose you are so confident on this model that you write a paper about the topic and send it to a journal. Two weeks later you receive a letter from a referee saying she is suspicious about your log-linear equation. She asks you to reestimate the same model but using the dependent variable linearly (i.e., without the logs), and the rest of the equation remaining as before. She asks you to revise and resubmit the paper with your new findings.

In the search for elements that support your original model, you start the following experiment: 1. Run the model suggested by the referee, using a Box-Cox transformation to find the MLE of \(\lambda\), 2. Plot the concentrated log-likelihood function, and 3. Reestimate the model conditional on the MLE of \(\lambda\):

    library(MASS)
    y<-exp(y)
    g<-boxcox(y~p+z+p2+pz, lambda =c(-1, -0.5,-0.25,0,0.25,0.5,1))

Note first that we called the MASS package which contains the Box-Cox transformation function for linear models. The boxcox function plots values of \(\lambda\) against the log-likelihood of the resulting model. Our objective is to maximize the log-likelihood, and so the function draws a line at the optimum. As you see, the MLE for \(\lambda\) is very close to zero. The picture is drawn using smoothing spline techniques to help you envisage the log-likelihood function and the MLE of \(\lambda\). The horizontal line corresponds to the 95% confidence interval.

To find out the value of \(\lambda\) that maximizes the log-likelihood we use the which.max function:

    lambda <- g$x[which.max(g$y)]
    lambda
[1] -0.05050505

As you can see it is indeed very close to zero. Next we can apply the power transform to y and then fit the revised model:

    y <- ((y^lambda)-1)/lambda
    summary(lm(y~p+z+p2+pz))

Call:
lm(formula = y ~ p + z + p2 + pz)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.193147 -0.042213  0.008169  0.041133  0.147185 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -7.63514    0.21227 -35.969  < 2e-16 ***
p           -2.62521    0.19409 -13.526  < 2e-16 ***
z            2.65928    0.08110  32.788  < 2e-16 ***
p2          -0.28021    0.03921  -7.146 1.72e-11 ***
pz           0.72540    0.06195  11.709  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06272 on 196 degrees of freedom
Multiple R-squared:  0.9685,    Adjusted R-squared:  0.9679 
F-statistic:  1507 on 4 and 196 DF,  p-value: < 2.2e-16

The latter regression (conditional on the MLE of \(\lambda\)) provides results close to your log-linear suggestion. Now you have reasonable support to write back the referee and defend your original model.

Box-Cox Transformation for Dependent and Independent Variables

Let us now illustrate the Box-Cox transformation of both sides of the equation. For simplicity in the exposition, assume that we want to estimate the model

\[ y = x\beta +u \]

Where \(u\) is a gaussian error, and \(x\) and \(y\) are column vectors. If \(bc(\cdot,\lambda)\) denotes the Box-Cox transformation, we would like to estimate \(\lambda\) that maximizes the likelihood, conditional on \(x\), given that we assume that the transformation restablish the classical assumptions on the error term of the model

\[ bc(y,\lambda) = bc(x,\lambda) \alpha +v; \]

That is, we assume that the transformation ensures \(v \sim N (0,\sigma^2I_n)\).

Let \(\tilde{y}\) denote the geometric mean of the elements in the vector \(y\). Further transforming the left hand side of the equation as \(\check{y}=y/\tilde{y}\) (Kill the Jacobian), we get the log-likelihood as if we regress \(bc(\check{y},\lambda)\) on \(bc(x,\lambda)\). For given \(x\) and \(y\), this log-likelihood only depends on \(\lambda\).

  set.seed(101) #to replicate the same results at home
  x <- exp(rnorm(50))
  y <- exp(1+.5*log(x) +rnorm(50)/3)
  y <- y/exp(mean(log(y))) # Kill the Jacobian

  bc <- function(lam, x, eps = 0.0001)
    if(abs(lam) < eps) log(x) else (x^lam - 1)/lam

  logL <- function(lam){
    z <- lam
    for(i in 1:length(lam)){
        z[i] <- logLik(lm(bc(lam[i],y) ~ bc(lam[i],x)))
      }
    z
  }

Given, \(x\) and \(y\), the log-likelihood is a function of \(\lambda\) only. We can plot the funstion as follows:

  # plot likelihood
  curve(logL, -2, 2,xlab = "lambda", ylab = "LogL(lambda)")

Finally, we can find the optimum \(\lambda\)

  # optimize
  f <- optimize(logL, c(-2,2), maximum = TRUE)
  (lambda<-f$maximum)
[1] -0.3686444

Andrews Test

Finally, for the Econ 536 problem set 1, question 2, you are also required to perform the David Andrews (1971) test. As an example, I use the same gasoline data as above, and follow the routine on Professor Koenker’s Lecture Note 2:

  1. Run the linear model and get the predicted values of y (call this variable \(\hat y\)):
    gas<-read.table("http://www.econ.uiuc.edu/~econ536/Data/gasnew.txt",header=T)
    y<-exp(gas$ln.y.)
    p<-exp(gas$ln.p.)
    z <-exp(gas$ln.z.)
    h<-lm(y~p+z)
    summary(h)

Call:
lm(formula = y ~ p + z)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.101017 -0.025600  0.004442  0.024954  0.082228 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.153673   0.011228  -13.69   <2e-16 ***
p           -0.337591   0.014498  -23.29   <2e-16 ***
z            0.074107   0.001629   45.49   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.03393 on 198 degrees of freedom
Multiple R-squared:  0.9421,    Adjusted R-squared:  0.9415 
F-statistic:  1610 on 2 and 198 DF,  p-value: < 2.2e-16
    yhat<-fitted(h)
  1. Reestimate the augmented model and test \(\gamma = 0\):
    Ly<-log(yhat)
    yLy<-yhat*Ly
    v<-lm(y~p+z+yLy)
    summary(v)

Call:
lm(formula = y ~ p + z + yLy)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.102448 -0.018379  0.005652  0.017794  0.063700 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.284515   0.046057   6.177 3.66e-09 ***
p           -0.295844   0.012702 -23.291  < 2e-16 ***
z            0.065162   0.001629  40.012  < 2e-16 ***
yLy          1.082626   0.111471   9.712  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.02797 on 197 degrees of freedom
Multiple R-squared:  0.9608,    Adjusted R-squared:  0.9602 
F-statistic:  1611 on 3 and 197 DF,  p-value: < 2.2e-16
    anova(h,v)
Analysis of Variance Table

Model 1: y ~ p + z
Model 2: y ~ p + z + yLy
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1    198 0.22795                                  
2    197 0.15415  1  0.073807 94.326 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the test above we can reject the null hypothesis that \(\gamma =0\). Can you interpret what does this mean?


  1. Please send comments to hrtdmrt2@illinois.edu or srmntbr2@illinois.edu