Welcome to a new issue of e-Tutorial. This e-TA will focus on Cubic B-Splines and Quantile Regression.1

Data

You can download the data set, called weco14.txt from the Econ 356 web site. Save it in your preferred directory.

Then you can load it in R. If you saved it in your hard drive in a folder called econ356 you can load it by typing:

weco<-read.table("C:/econ356/weco14.txt", header=T, sep="")

or you can retrieve it from the web

weco<-read.table("http://www.econ.uiuc.edu/~econ536/Data/weco14.txt", header=T, sep="")
head(weco)
      y sex dex lex  kwit job_tenure status treatment ypost
1 13.73   0  38  10 FALSE        277   TRUE      TRUE 14.35
2 17.15   1  55  11  TRUE        173   TRUE        NA    NA
3 13.63   1  45  12 FALSE        410   TRUE      TRUE 15.75
4 13.04   1  41  11 FALSE        247   TRUE     FALSE 18.33
5 13.20   1  42  10 FALSE        340   TRUE     FALSE 13.96
6 16.43   0  38  12 FALSE        517   TRUE      TRUE 18.54

Cubic B-Splines

First we begin by estimating the model proposed in question 1 of PS5

\[ y = \alpha_{0} + \alpha_{1} sex + \alpha_{2} dex + \alpha_{3} lex + \alpha_{4} lex^2 + u \]

To estimate this model first we need to create \(lex^2\)

weco$lex2<-weco$lex^2

and then we are ready to estimate the model.

lm <- lm(y~sex+dex+lex+lex2,data=weco)
summary(lm)

Call:
lm(formula = y ~ sex + dex + lex + lex2, data = weco)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9127 -0.7222  0.0338  0.7635  3.1867 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.524386   2.032106   2.719  0.00672 ** 
sex         -0.900361   0.087498 -10.290  < 2e-16 ***
dex          0.112070   0.006004  18.666  < 2e-16 ***
lex          0.821953   0.321337   2.558  0.01075 *  
lex2        -0.036049   0.012809  -2.814  0.00503 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.124 on 678 degrees of freedom
Multiple R-squared:  0.3881,    Adjusted R-squared:  0.3845 
F-statistic: 107.5 on 4 and 678 DF,  p-value: < 2.2e-16

Next we estimate a “(more) nonparametric version” using Cubic B-Splines. To do so we need first to load the splines package

library(splines)

Then we are ready to estimate a model of the form

\[ y = \alpha_{0} + \alpha_{1} sex + \alpha_{2} dex + g(lex, \alpha) + u \]

where \(g(.)\) is a spline. In R we write

spline <- lm(y ~ sex+dex+bs(lex), data=weco)
summary(spline)

Call:
lm(formula = y ~ sex + dex + bs(lex), data = weco)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9070 -0.7323  0.0376  0.7446  3.2079 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.939092   0.450427  22.066   <2e-16 ***
sex         -0.896815   0.087722 -10.223   <2e-16 ***
dex          0.112292   0.006017  18.662   <2e-16 ***
bs(lex)1     0.367001   0.961615   0.382   0.7028    
bs(lex)2     0.743030   0.767423   0.968   0.3333    
bs(lex)3    -2.290066   1.123736  -2.038   0.0419 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.124 on 677 degrees of freedom
Multiple R-squared:  0.3885,    Adjusted R-squared:  0.384 
F-statistic: 86.01 on 5 and 677 DF,  p-value: < 2.2e-16

Note that we have not specified neither the knot number, the location or the degree. However this can be easily done:

number.knots <- 4
sp <- seq(from=min(weco$lex),to=max(weco$lex),length=number.knots+2)[2:(number.knots+1)]
spline2 <- lm(y ~ sex+dex+ bs(lex, knots=sp, degree = 2), data=weco)
summary(spline2)

Call:
lm(formula = y ~ sex + dex + bs(lex, knots = sp, degree = 2), 
    data = weco)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9122 -0.7294  0.0340  0.7605  3.2231 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      10.195420   0.582408  17.506   <2e-16 ***
sex                              -0.893801   0.088065 -10.149   <2e-16 ***
dex                               0.112056   0.006036  18.564   <2e-16 ***
bs(lex, knots = sp, degree = 2)1 -0.294565   0.671539  -0.439   0.6611    
bs(lex, knots = sp, degree = 2)2  0.051583   0.528281   0.098   0.9222    
bs(lex, knots = sp, degree = 2)3 -0.080548   0.573395  -0.140   0.8883    
bs(lex, knots = sp, degree = 2)4 -0.591614   0.700762  -0.844   0.3988    
bs(lex, knots = sp, degree = 2)5 -1.787994   1.836222  -0.974   0.3305    
bs(lex, knots = sp, degree = 2)6 -2.246885   1.245786  -1.804   0.0717 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.126 on 674 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3817 
F-statistic: 53.62 on 8 and 674 DF,  p-value: < 2.2e-16

to learn more about the bs() function read carefully the help file ?bs.

You can also plot the data with the regression spline overlain:

lex_val <- seq(from=min(weco$lex), to=max(weco$lex), length=1000)
weco_m<-data.frame(lex=lex_val, sex=1, dex=mean(weco$dex))
yhat_m<-predict(spline, newdata=weco_m)
weco_w<-data.frame(lex=lex_val, sex=0, dex=mean(weco$dex))
yhat_w<-predict(spline, newdata=weco_w)

plot(weco$lex, weco$y)
lines(weco_m$lex, yhat_m, col="blue")
lines(weco_w$lex, yhat_w, col="red")

Note that we have defined new data where we are going to evaluate our estimates and used those to plot.

Smoothing Splines

Previously we used cubic B- splines to check the quadratic specification of the problem. When we did so we defined a set of knots, which produced a sequence of basis functions, which in turn we used least squares to estimate the spline coefficients. In this subsection we look at a different approach. What we are doing basically is to fit a smooth curve. So what we really want is to find a function, say \(g(x)\), that fits the observed data. A way to do so is to use smoothing splines which basically finds the function g that minimizes

\[ \sum_{i=1} ^n =(y_i − g(x_i))^{2} + λ \int g′′(t)^2 dt \]

where λ is a nonnegative tuning parameter. The function g that minimizes this equation is known as a smoothing spline. (You can read more about smoothing splines on “An Introduction to Statistical Learning” (2013) by James, Witten, Hastie and Tibshirani.)

In R the smooth.spline function is readily available. We can estimate the smoothing spline and plot it with the following line:

plot(weco$lex,weco$y)
lines(smooth.spline(weco$lex,weco$y, tol=0.0001), col= "red")

Note that the we added the tol option. The function smooth.spline uses as a default the IQR times a small number as tolerance. However, in our case the lex variable has IQR zero, which in turns if we do not specify the tol option we would get an error message.

Quantile Regression

In Question 2 of PS5 we are asked to consider a quantile regression model that relates productivity, sex, dex and lex. For example we can think on a model of the form

\[ Q_{yi}(\tau|sex,dex,lex) = \alpha_0(\tau) + \alpha_1(\tau)sex_i +\alpha_2(\tau)+\alpha_3(\tau)lex_i+\alpha_4(\tau)lex_i ^2\]

where \(Q_{yi}(\tau|sex,dex,lex)\) is the \(\tau\)th conditional quantile. To estimate this model we need first to call the quantreg package:

require(quantreg)
Loading required package: quantreg
Loading required package: SparseM

Attaching package: 'SparseM'
The following object is masked from 'package:base':

    backsolve

And then simply invoke the rq function

quantiles<- rq(y~sex+dex+lex+lex2,data=weco, tau=1:9/10)
Warning in rq.fit.br(x, y, tau = tau, ...): Solution may be nonunique
summary(quantiles)
Warning in rq.fit.br(x, y, tau = tau, ci = TRUE, ...): Solution may be
nonunique
Warning in rq.fit.br(x, y, tau = tau, ci = TRUE, ...): Solution may be
nonunique

Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)

tau: [1] 0.1

Coefficients:
            coefficients lower bd upper bd
(Intercept)  5.65705     -4.08409 12.15291
sex         -0.65105     -0.94260 -0.25418
dex          0.09684      0.08816  0.11162
lex          0.54608     -0.38571  2.16320
lex2        -0.02067     -0.08881  0.01385

Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)

tau: [1] 0.2

Coefficients:
            coefficients lower bd upper bd
(Intercept)  6.35258      0.13776  8.61430
sex         -0.77700     -0.98079 -0.54479
dex          0.09900      0.08457  0.12150
lex          0.57450      0.24132  1.48874
lex2        -0.02408     -0.06370 -0.01087

Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)

tau: [1] 0.3

Coefficients:
            coefficients lower bd upper bd
(Intercept)  4.97679      3.24204  8.39947
sex         -0.74184     -1.00890 -0.57752
dex          0.10600      0.09338  0.11903
lex          0.79649      0.27750  1.04206
lex2        -0.03316     -0.04294 -0.01776

Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)

tau: [1] 0.4

Coefficients:
            coefficients lower bd upper bd
(Intercept)  5.16129     -0.27310  7.12476
sex         -0.82495     -1.07156 -0.64884
dex          0.10935      0.09673  0.12194
lex          0.83855      0.50075  1.67757
lex2        -0.03618     -0.06308 -0.02367

Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)

tau: [1] 0.5

Coefficients:
            coefficients lower bd upper bd
(Intercept)  5.06257      0.61262  8.88741
sex         -0.89429     -1.17147 -0.75319
dex          0.11188      0.10420  0.12419
lex          0.91001      0.36295  1.52729
lex2        -0.03988     -0.06218 -0.02524

Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)

tau: [1] 0.6

Coefficients:
            coefficients lower bd upper bd
(Intercept)  4.83589      0.53003  8.36925
sex         -1.05875     -1.24862 -0.89690
dex          0.11375      0.10271  0.12760
lex          1.01271      0.50247  1.63053
lex2        -0.04449     -0.06798 -0.02593

Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)

tau: [1] 0.7

Coefficients:
            coefficients lower bd upper bd
(Intercept)  5.91141      3.16652 10.38617
sex         -1.01438     -1.19695 -0.91425
dex          0.12034      0.10281  0.13338
lex          0.88468      0.17698  1.34199
lex2        -0.04166     -0.05807 -0.00579

Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)

tau: [1] 0.8

Coefficients:
            coefficients lower bd upper bd
(Intercept)  4.98595      3.86154  8.01011
sex         -0.98375     -1.15823 -0.86810
dex          0.11875      0.10977  0.13000
lex          1.10537      0.62139  1.27378
lex2        -0.05078     -0.05748 -0.02891

Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)

tau: [1] 0.9

Coefficients:
            coefficients lower bd upper bd
(Intercept)  4.20009      2.40515  9.94748
sex         -1.19355     -1.44150 -0.93352
dex          0.13557      0.10601  0.14836
lex          1.22844     -0.03065  1.44578
lex2        -0.05716     -0.06662  0.00908

Note that we have run the regression for \(\tau=0.1, 0.2, ..., 0.9\) by simply adding the option tau=1:9/10. You can obtain the usual plot by simply calling the plot function and adjusting the margins.

plot(summary(quantiles), mar=c(0.5,2,2,2))
Warning in rq.fit.br(x, y, tau = tau, ci = TRUE, ...): Solution may be
nonunique

Warning in rq.fit.br(x, y, tau = tau, ci = TRUE, ...): Solution may be
nonunique

You may have noticed some Warning messages in the output, you have nothing to worry about. Prof. Koenker explains this in a very clear way: “When computing the median from a sample with an even number of distinct values there is inherently some ambiguity about its value: any value between the middle order statistics is”a" median. Similarly, in regression settings the optimization problem solved by the “br” version of the simplex algorithm, modified to do general quantile regression identifies cases where there may be non uniqueness of this type. When there are “continuous” covariates this is quite rare, when covariates are discrete then it is relatively common, at least when tau is chosen from the rationals. For univariate quantiles R provides several methods of resolving this sort of ambiguity by interpolation, “br” doesn’t try to do this, instead returning the first vertex solution that it comes to. Should we worry about this? My answer would be no…"


  1. Please send comments to hrtdmrt2@illinois.edu or srmntbr2@illinois.edu