Applied econometrics: Econ 508

## TA: Nicolas Bottan

Welcome to a new issue of e-Tutorial. This e-TA will focus on Cubic B-Splines and Quantile Regression.1

# Data

You can download the data set, called weco14.txt from the Econ 508 web site. Save it in your preferred directory.

Then you can load it in R. If you saved it in your hard drive in a folder called econ508 you can load it by typing:

weco<-read.table("C:/econ508/weco14.txt", header=T, sep="")

or you can retrieve it from the web

weco<-read.table("http://www.econ.uiuc.edu/~econ508/data/weco14.txt", header=T, sep="")
head(weco)
         y sex dex lex  kwit job_tenure status
1 13.72579   0  38  10 FALSE        277   TRUE
2 17.15373   1  55  11  TRUE        173   TRUE
3 13.62992   1  45  12 FALSE        410   TRUE
4 13.03998   1  41  11 FALSE        247   TRUE
5 13.19510   1  42  10 FALSE        340   TRUE
6 16.43010   0  38  12 FALSE        517   TRUE

# Cubic B-Splines

First we begin by estimating the model proposed in question 1 of PS5

$y = \alpha_{0} + \alpha_{1} sex + \alpha_{2} dex + \alpha_{3} lex + \alpha_{4} lex^2 + u$

To estimate this model first we need to create $$lex^2$$

weco$lex2<-weco$lex^2

and then we are ready to estimate the model.

lm <- lm(y~sex+dex+lex+lex2,data=weco)
summary(lm)

Call:
lm(formula = y ~ sex + dex + lex + lex2, data = weco)

Residuals:
Min      1Q  Median      3Q     Max
-3.9094 -0.7227  0.0357  0.7647  3.1867

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  5.527936   2.031914   2.721  0.00668 **
sex         -0.900641   0.087489 -10.294  < 2e-16 ***
dex          0.112088   0.006003  18.671  < 2e-16 ***
lex          0.821391   0.321307   2.556  0.01079 *
lex2        -0.036032   0.012808  -2.813  0.00505 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.124 on 678 degrees of freedom
Multiple R-squared:  0.3883,    Adjusted R-squared:  0.3847
F-statistic: 107.6 on 4 and 678 DF,  p-value: < 2.2e-16

Next we estimate a “(more) nonparametric version” using Cubic B-Splines. To do so we need first to load the splines package

library(splines)

Then we are ready to estimate a model of the form

$y = \alpha_{0} + \alpha_{1} sex + \alpha_{2} dex + g(lex, \alpha) + u$

where $$g(.)$$ is a spline. In R we write

spline <- lm(y ~ sex+dex+bs(lex), data=weco)
summary(spline)

Call:
lm(formula = y ~ sex + dex + bs(lex), data = weco)

Residuals:
Min      1Q  Median      3Q     Max
-3.9038 -0.7319  0.0386  0.7451  3.2079

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  9.938776   0.450386  22.067   <2e-16 ***
sex         -0.897106   0.087714 -10.228   <2e-16 ***
dex          0.112310   0.006017  18.667   <2e-16 ***
bs(lex)1     0.367654   0.961526   0.382    0.702
bs(lex)2     0.740328   0.767352   0.965    0.335
bs(lex)3    -2.289159   1.123632  -2.037    0.042 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.124 on 677 degrees of freedom
Multiple R-squared:  0.3886,    Adjusted R-squared:  0.3841
F-statistic: 86.06 on 5 and 677 DF,  p-value: < 2.2e-16

Note that we have not specified neither the knot number, the location or the degree. However this can be easily done:

number.knots <- 4
sp <- seq(from=min(weco$lex),to=max(weco$lex),length=number.knots+2)[2:(number.knots+1)]
spline2 <- lm(y ~ sex+dex+ bs(lex, knots=sp, degree = 2), data=weco)
summary(spline2)

Call:
lm(formula = y ~ sex + dex + bs(lex, knots = sp, degree = 2),
data = weco)

Residuals:
Min      1Q  Median      3Q     Max
-3.9089 -0.7290  0.0341  0.7624  3.2233

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)                      10.196385   0.582351  17.509   <2e-16 ***
sex                              -0.894068   0.088057 -10.153   <2e-16 ***
dex                               0.112072   0.006036  18.568   <2e-16 ***
bs(lex, knots = sp, degree = 2)1 -0.296163   0.671473  -0.441    0.659
bs(lex, knots = sp, degree = 2)2  0.050044   0.528230   0.095    0.925
bs(lex, knots = sp, degree = 2)3 -0.082151   0.573339  -0.143    0.886
bs(lex, knots = sp, degree = 2)4 -0.594566   0.700694  -0.849    0.396
bs(lex, knots = sp, degree = 2)5 -1.789575   1.836043  -0.975    0.330
bs(lex, knots = sp, degree = 2)6 -2.244919   1.245665  -1.802    0.072 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.126 on 674 degrees of freedom
Multiple R-squared:  0.3891,    Adjusted R-squared:  0.3818
F-statistic: 53.65 on 8 and 674 DF,  p-value: < 2.2e-16

to learn more about the bs() function read carefully the help file ?bs.

You can also plot the data with the regression spline overlain:

lex_val <- seq(from=min(weco$lex), to=max(weco$lex), length=1000)
weco_m<-data.frame(lex=lex_val, sex=1, dex=mean(weco$dex)) yhat_m<-predict(spline, newdata=weco_m) weco_w<-data.frame(lex=lex_val, sex=0, dex=mean(weco$dex))
yhat_w<-predict(spline, newdata=weco_w)

plot(weco$lex, weco$y)
lines(weco_m$lex, yhat_m, col="blue") lines(weco_w$lex, yhat_w, col="red")

Note that we have defined new data where we are going to evaluate our estimates and used those to plot.

## Smoothing Splines

Previously we used cubic B- splines to check the quadratic specification of the problem. When we did so we defined a set of knots, which produced a sequence of basis functions, which in turn we used least squares to estimate the spline coefficients. In this subsection we look at a different approach. What we are doing basically is to fit a smooth curve. So what we really want is to find a function, say $$g(x)$$, that fits the observed data. A way to do so is to use smoothing splines which basically finds the function g that minimizes

$\sum_{i=1} ^n =(y_i − g(x_i))^{2} + λ \int g′′(t)^2 dt$

where λ is a nonnegative tuning parameter. The function g that minimizes this equation is known as a smoothing spline. (You can read more about smoothing splines on “An Introduction to Statistical Learning” (2013) by James, Witten, Hastie and Tibshirani.)

In R the smooth.spline function is readily available. We can estimate the smoothing spline and plot it with the following line:

plot(weco$lex,weco$y)
lines(smooth.spline(weco$lex,weco$y, tol=0.0001), col= "red")

Note that the we added the tol option. The function smooth.spline uses as a default the IQR times a small number as tolerance. However, in our case the lex variable has IQR zero, which in turns if we do not specify the tol option we would get an error message.

# Quantile Regression

In Question 2 of PS5 we are asked to consider a quantile regression model that relates productivity, sex, dex and lex. For example we can think on a model of the form

$Q_{yi}(\tau|sex,dex,lex) = \alpha_0(\tau) + \alpha_1(\tau)sex_i +\alpha_2(\tau)+\alpha_3(\tau)lex_i+\alpha_4(\tau)lex_i ^2$

where $$Q_{yi}(\tau|sex,dex,lex)$$ is the $$\tau$$th conditional quantile. To estimate this model we need first to call the quantreg package:

require(quantreg)
Loading required package: quantreg

Attaching package: 'SparseM'

The following object is masked from 'package:base':

backsolve

And then simply invoke the rq function

quantiles<- rq(y~sex+dex+lex+lex2,data=weco, tau=1:9/10)
Warning in rq.fit.br(x, y, tau = tau, ...): Solution may be nonunique
summary(quantiles)
Warning in rq.fit.br(x, y, tau = tau, ci = TRUE, ...): Solution may be
nonunique
Warning in rq.fit.br(x, y, tau = tau, ci = TRUE, ...): Solution may be
nonunique

Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)

tau: [1] 0.1

Coefficients:
coefficients lower bd upper bd
(Intercept)  5.62075     -4.07301 11.97694
sex         -0.64677     -0.94562 -0.25659
dex          0.09684      0.08817  0.11135
lex          0.55009     -0.35755  2.17286
lex2        -0.02078     -0.08892  0.01371

Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)

tau: [1] 0.2

Coefficients:
coefficients lower bd upper bd
(Intercept)  6.34850      0.03011  8.68471
sex         -0.77493     -0.98150 -0.53937
dex          0.09881      0.08479  0.11955
lex          0.57540      0.23176  1.47909
lex2        -0.02409     -0.06233 -0.01049

Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)

tau: [1] 0.3

Coefficients:
coefficients lower bd upper bd
(Intercept)  4.97274      3.06279  8.39897
sex         -0.76118     -0.99908 -0.57830
dex          0.10680      0.09359  0.12035
lex          0.79339      0.26424  1.02846
lex2        -0.03305     -0.04247 -0.01898

Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)

tau: [1] 0.4

Coefficients:
coefficients lower bd upper bd
(Intercept)  4.92704      0.36657  7.17517
sex         -0.82228     -1.07392 -0.64663
dex          0.10970      0.09634  0.12242
lex          0.86859      0.51090  1.68904
lex2        -0.03717     -0.06287 -0.02413

Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)

tau: [1] 0.5

Coefficients:
coefficients lower bd upper bd
(Intercept)  4.92366      0.73782  9.04266
sex         -0.89662     -1.17195 -0.75291
dex          0.11193      0.10428  0.12261
lex          0.92807      0.37136  1.52974
lex2        -0.04043     -0.06286 -0.02524

Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)

tau: [1] 0.6

Coefficients:
coefficients lower bd upper bd
(Intercept)  4.97463      0.52162  8.32821
sex         -1.06129     -1.24953 -0.90259
dex          0.11337      0.10259  0.12758
lex          0.99550      0.49204  1.63279
lex2        -0.04390     -0.06810 -0.02558

Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)

tau: [1] 0.7

Coefficients:
coefficients lower bd upper bd
(Intercept)  5.93979      3.18574 10.40087
sex         -1.01875     -1.19764 -0.92246
dex          0.11995      0.10213  0.13310
lex          0.88328      0.17983  1.28561
lex2        -0.04162     -0.05794 -0.00579

Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)

tau: [1] 0.8

Coefficients:
coefficients lower bd upper bd
(Intercept)  4.93541      3.81895  7.93074
sex         -0.98152     -1.15848 -0.86910
dex          0.11849      0.10980  0.13014
lex          1.11375      0.59723  1.28604
lex2        -0.05104     -0.05793 -0.02883

Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)

tau: [1] 0.9

Coefficients:
coefficients lower bd upper bd
(Intercept)  4.24133      2.36447  9.97680
sex         -1.19893     -1.44073 -0.93744
dex          0.13548      0.10668  0.14862
lex          1.22332     -0.02308  1.43841
lex2        -0.05696     -0.06712  0.00823

Note that we have run the regression for $$\tau=0.1, 0.2, ..., 0.9$$ by simply adding the option tau=1:9/10. You can obtain the usual plot by simply calling the plot function and adjusting the margins.

plot(summary(quantiles), mar=c(0.5,2,2,2))
Warning in rq.fit.br(x, y, tau = tau, ci = TRUE, ...): Solution may be
nonunique
Warning in rq.fit.br(x, y, tau = tau, ci = TRUE, ...): Solution may be
nonunique

You may have noticed some Warning messages in the output, you have nothing to worry about. Prof. Koenker explains this in a very clear way: “When computing the median from a sample with an even number of distinct values there is inherently some ambiguity about its value: any value between the middle order statistics is”a" median. Similarly, in regression settings the optimization problem solved by the “br” version of the simplex algorithm, modified to do general quantile regression identifies cases where there may be non uniqueness of this type. When there are “continuous” covariates this is quite rare, when covariates are discrete then it is relatively common, at least when tau is chosen from the rationals. For univariate quantiles R provides several methods of resolving this sort of ambiguity by interpolation, “br” doesn’t try to do this, instead returning the first vertex solution that it comes to. Should we worry about this? My answer would be no…"