Welcome to a new issue of e-Tutorial. This e-TA will focus on Cubic B-Splines and Quantile Regression.1
You can download the data set, called weco14.txt from the Econ 356 web site. Save it in your preferred directory.
Then you can load it in R. If you saved it in your hard drive in a folder called econ356 you can load it by typing:
weco<-read.table("C:/econ356/weco14.txt", header=T, sep="")
or you can retrieve it from the web
weco<-read.table("http://www.econ.uiuc.edu/~econ536/Data/weco14.txt", header=T, sep="")
head(weco)
y sex dex lex kwit job_tenure status treatment ypost
1 13.73 0 38 10 FALSE 277 TRUE TRUE 14.35
2 17.15 1 55 11 TRUE 173 TRUE NA NA
3 13.63 1 45 12 FALSE 410 TRUE TRUE 15.75
4 13.04 1 41 11 FALSE 247 TRUE FALSE 18.33
5 13.20 1 42 10 FALSE 340 TRUE FALSE 13.96
6 16.43 0 38 12 FALSE 517 TRUE TRUE 18.54
First we begin by estimating the model proposed in question 1 of PS5
\[ y = \alpha_{0} + \alpha_{1} sex + \alpha_{2} dex + \alpha_{3} lex + \alpha_{4} lex^2 + u \]
To estimate this model first we need to create \(lex^2\)
weco$lex2<-weco$lex^2
and then we are ready to estimate the model.
lm <- lm(y~sex+dex+lex+lex2,data=weco)
summary(lm)
Call:
lm(formula = y ~ sex + dex + lex + lex2, data = weco)
Residuals:
Min 1Q Median 3Q Max
-3.9127 -0.7222 0.0338 0.7635 3.1867
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.524386 2.032106 2.719 0.00672 **
sex -0.900361 0.087498 -10.290 < 2e-16 ***
dex 0.112070 0.006004 18.666 < 2e-16 ***
lex 0.821953 0.321337 2.558 0.01075 *
lex2 -0.036049 0.012809 -2.814 0.00503 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.124 on 678 degrees of freedom
Multiple R-squared: 0.3881, Adjusted R-squared: 0.3845
F-statistic: 107.5 on 4 and 678 DF, p-value: < 2.2e-16
Next we estimate a “(more) nonparametric version” using Cubic B-Splines. To do so we need first to load the splines
package
library(splines)
Then we are ready to estimate a model of the form
\[ y = \alpha_{0} + \alpha_{1} sex + \alpha_{2} dex + g(lex, \alpha) + u \]
where \(g(.)\) is a spline. In R
we write
spline <- lm(y ~ sex+dex+bs(lex), data=weco)
summary(spline)
Call:
lm(formula = y ~ sex + dex + bs(lex), data = weco)
Residuals:
Min 1Q Median 3Q Max
-3.9070 -0.7323 0.0376 0.7446 3.2079
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.939092 0.450427 22.066 <2e-16 ***
sex -0.896815 0.087722 -10.223 <2e-16 ***
dex 0.112292 0.006017 18.662 <2e-16 ***
bs(lex)1 0.367001 0.961615 0.382 0.7028
bs(lex)2 0.743030 0.767423 0.968 0.3333
bs(lex)3 -2.290066 1.123736 -2.038 0.0419 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.124 on 677 degrees of freedom
Multiple R-squared: 0.3885, Adjusted R-squared: 0.384
F-statistic: 86.01 on 5 and 677 DF, p-value: < 2.2e-16
Note that we have not specified neither the knot number, the location or the degree. However this can be easily done:
number.knots <- 4
sp <- seq(from=min(weco$lex),to=max(weco$lex),length=number.knots+2)[2:(number.knots+1)]
spline2 <- lm(y ~ sex+dex+ bs(lex, knots=sp, degree = 2), data=weco)
summary(spline2)
Call:
lm(formula = y ~ sex + dex + bs(lex, knots = sp, degree = 2),
data = weco)
Residuals:
Min 1Q Median 3Q Max
-3.9122 -0.7294 0.0340 0.7605 3.2231
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.195420 0.582408 17.506 <2e-16 ***
sex -0.893801 0.088065 -10.149 <2e-16 ***
dex 0.112056 0.006036 18.564 <2e-16 ***
bs(lex, knots = sp, degree = 2)1 -0.294565 0.671539 -0.439 0.6611
bs(lex, knots = sp, degree = 2)2 0.051583 0.528281 0.098 0.9222
bs(lex, knots = sp, degree = 2)3 -0.080548 0.573395 -0.140 0.8883
bs(lex, knots = sp, degree = 2)4 -0.591614 0.700762 -0.844 0.3988
bs(lex, knots = sp, degree = 2)5 -1.787994 1.836222 -0.974 0.3305
bs(lex, knots = sp, degree = 2)6 -2.246885 1.245786 -1.804 0.0717 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.126 on 674 degrees of freedom
Multiple R-squared: 0.3889, Adjusted R-squared: 0.3817
F-statistic: 53.62 on 8 and 674 DF, p-value: < 2.2e-16
to learn more about the bs()
function read carefully the help file ?bs
.
You can also plot the data with the regression spline overlain:
lex_val <- seq(from=min(weco$lex), to=max(weco$lex), length=1000)
weco_m<-data.frame(lex=lex_val, sex=1, dex=mean(weco$dex))
yhat_m<-predict(spline, newdata=weco_m)
weco_w<-data.frame(lex=lex_val, sex=0, dex=mean(weco$dex))
yhat_w<-predict(spline, newdata=weco_w)
plot(weco$lex, weco$y)
lines(weco_m$lex, yhat_m, col="blue")
lines(weco_w$lex, yhat_w, col="red")
Note that we have defined new data where we are going to evaluate our estimates and used those to plot.
Previously we used cubic B- splines to check the quadratic specification of the problem. When we did so we defined a set of knots, which produced a sequence of basis functions, which in turn we used least squares to estimate the spline coefficients. In this subsection we look at a different approach. What we are doing basically is to fit a smooth curve. So what we really want is to find a function, say \(g(x)\), that fits the observed data. A way to do so is to use smoothing splines which basically finds the function g that minimizes
\[ \sum_{i=1} ^n =(y_i â g(x_i))^{2} + λ \int gâ²â²(t)^2 dt \]
where λ is a nonnegative tuning parameter. The function g that minimizes this equation is known as a smoothing spline. (You can read more about smoothing splines on “An Introduction to Statistical Learning” (2013) by James, Witten, Hastie and Tibshirani.)
In R the smooth.spline
function is readily available. We can estimate the smoothing spline and plot it with the following line:
plot(weco$lex,weco$y)
lines(smooth.spline(weco$lex,weco$y, tol=0.0001), col= "red")
Note that the we added the tol
option. The function smooth.spline
uses as a default the IQR
times a small number as tolerance. However, in our case the lex variable has IQR
zero, which in turns if we do not specify the tol
option we would get an error message.
In Question 2 of PS5 we are asked to consider a quantile regression model that relates productivity, sex, dex and lex. For example we can think on a model of the form
\[ Q_{yi}(\tau|sex,dex,lex) = \alpha_0(\tau) + \alpha_1(\tau)sex_i +\alpha_2(\tau)+\alpha_3(\tau)lex_i+\alpha_4(\tau)lex_i ^2\]
where \(Q_{yi}(\tau|sex,dex,lex)\) is the \(\tau\)th conditional quantile. To estimate this model we need first to call the quantreg
package:
require(quantreg)
Loading required package: quantreg
Loading required package: SparseM
Attaching package: 'SparseM'
The following object is masked from 'package:base':
backsolve
And then simply invoke the rq
function
quantiles<- rq(y~sex+dex+lex+lex2,data=weco, tau=1:9/10)
Warning in rq.fit.br(x, y, tau = tau, ...): Solution may be nonunique
summary(quantiles)
Warning in rq.fit.br(x, y, tau = tau, ci = TRUE, ...): Solution may be
nonunique
Warning in rq.fit.br(x, y, tau = tau, ci = TRUE, ...): Solution may be
nonunique
Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)
tau: [1] 0.1
Coefficients:
coefficients lower bd upper bd
(Intercept) 5.65705 -4.08409 12.15291
sex -0.65105 -0.94260 -0.25418
dex 0.09684 0.08816 0.11162
lex 0.54608 -0.38571 2.16320
lex2 -0.02067 -0.08881 0.01385
Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)
tau: [1] 0.2
Coefficients:
coefficients lower bd upper bd
(Intercept) 6.35258 0.13776 8.61430
sex -0.77700 -0.98079 -0.54479
dex 0.09900 0.08457 0.12150
lex 0.57450 0.24132 1.48874
lex2 -0.02408 -0.06370 -0.01087
Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)
tau: [1] 0.3
Coefficients:
coefficients lower bd upper bd
(Intercept) 4.97679 3.24204 8.39947
sex -0.74184 -1.00890 -0.57752
dex 0.10600 0.09338 0.11903
lex 0.79649 0.27750 1.04206
lex2 -0.03316 -0.04294 -0.01776
Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)
tau: [1] 0.4
Coefficients:
coefficients lower bd upper bd
(Intercept) 5.16129 -0.27310 7.12476
sex -0.82495 -1.07156 -0.64884
dex 0.10935 0.09673 0.12194
lex 0.83855 0.50075 1.67757
lex2 -0.03618 -0.06308 -0.02367
Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)
tau: [1] 0.5
Coefficients:
coefficients lower bd upper bd
(Intercept) 5.06257 0.61262 8.88741
sex -0.89429 -1.17147 -0.75319
dex 0.11188 0.10420 0.12419
lex 0.91001 0.36295 1.52729
lex2 -0.03988 -0.06218 -0.02524
Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)
tau: [1] 0.6
Coefficients:
coefficients lower bd upper bd
(Intercept) 4.83589 0.53003 8.36925
sex -1.05875 -1.24862 -0.89690
dex 0.11375 0.10271 0.12760
lex 1.01271 0.50247 1.63053
lex2 -0.04449 -0.06798 -0.02593
Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)
tau: [1] 0.7
Coefficients:
coefficients lower bd upper bd
(Intercept) 5.91141 3.16652 10.38617
sex -1.01438 -1.19695 -0.91425
dex 0.12034 0.10281 0.13338
lex 0.88468 0.17698 1.34199
lex2 -0.04166 -0.05807 -0.00579
Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)
tau: [1] 0.8
Coefficients:
coefficients lower bd upper bd
(Intercept) 4.98595 3.86154 8.01011
sex -0.98375 -1.15823 -0.86810
dex 0.11875 0.10977 0.13000
lex 1.10537 0.62139 1.27378
lex2 -0.05078 -0.05748 -0.02891
Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)
tau: [1] 0.9
Coefficients:
coefficients lower bd upper bd
(Intercept) 4.20009 2.40515 9.94748
sex -1.19355 -1.44150 -0.93352
dex 0.13557 0.10601 0.14836
lex 1.22844 -0.03065 1.44578
lex2 -0.05716 -0.06662 0.00908
Note that we have run the regression for \(\tau=0.1, 0.2, ..., 0.9\) by simply adding the option tau=1:9/10
. You can obtain the usual plot by simply calling the plot
function and adjusting the margins.
plot(summary(quantiles), mar=c(0.5,2,2,2))
Warning in rq.fit.br(x, y, tau = tau, ci = TRUE, ...): Solution may be
nonunique
Warning in rq.fit.br(x, y, tau = tau, ci = TRUE, ...): Solution may be
nonunique
You may have noticed some Warning messages in the output, you have nothing to worry about. Prof. Koenker explains this in a very clear way: “When computing the median from a sample with an even number of distinct values there is inherently some ambiguity about its value: any value between the middle order statistics is”a" median. Similarly, in regression settings the optimization problem solved by the “br” version of the simplex algorithm, modified to do general quantile regression identifies cases where there may be non uniqueness of this type. When there are “continuous” covariates this is quite rare, when covariates are discrete then it is relatively common, at least when tau is chosen from the rationals. For univariate quantiles R provides several methods of resolving this sort of ambiguity by interpolation, “br” doesn’t try to do this, instead returning the first vertex solution that it comes to. Should we worry about this? My answer would be no…"
Please send comments to hrtdmrt2@illinois.edu or srmntbr2@illinois.edu↩