\documentclass[a4paper]{amsart}
\usepackage{fancyvrb} 
\usepackage{harvard}
\usepackage{url}

%% markup commands for code/software
\let\code=\texttt
\let\pkg=\textbf
\let\proglang=\textsf

\SweaveOpts{keep.source=TRUE}

\title{Some Exercises on Quantile Regression}
\thanks{Version:  \today .  These exercises were originally developed for a 
short course given under the
auspices of  CEMMAP at UCL, 20-22 February, 2003.  My thanks to Andrew Chesher
and the Department of Economics at UCL for their hospitality on that occasion, 
and as always to the NSF for continuing research support. 
The exercises have been expanded somewhat for new short courses under the 
auspices of CREATES in Aarhus, 21-23 June, 2010, and at the LSE, 16-17 May, 2011. }
\author{Roger Koenker}
\begin{document}
\bibliographystyle{econometrica}

\maketitle

\pagestyle{myheadings}
\markboth{\sc Exercises in Quantile Regression}{\sc Roger Koenker}

\section*{Introduction}
These exercises are intended to provide an introduction to quantile regression computing
and illustrate some econometric applications of quantile regression methods.  For 
purposes of the course my intention would be to encourage all students to do the first
exercise, which gives an overview of the quantile regression software in R in the context
of an elementary bivariate Engel curve example.  The remaining exercises are more
open ended.  I would like students to choose one of these exercises according to their
own special interests.  Given the brief duration of the course, it is obviously
unrealistic to expect answers to these questions at the end of the course, but
I would be happy to get responses via email should you choose to continue working on
them after the course is finished. 

<<echo=false>>= 
options(continue=" ")
options(width=60) 
@

\subsection*{A Word on Software}  There is now some quantile regression functionality
in most statistical software systems.  Not surprisingly, I have a strong preference 
for the implementation provide by the \pkg{quantreg} package of R, since I've
devoted a considerable amount of effort to writing it.  R is a dialect of John Chambers's
S language and provides a very general, very elegant environment for data analysis
and statistical research.  It is fair to say that R is now the vehicle of choice
within the statistical computing community.  It remains to be seen whether it can
make serious inroads into econometrics, but I firmly believe that it is a worthwhile
investment for the younger cohorts of econometricians.  R is public domain software
and can be freely downloaded from the CRAN website.  There is extensive documentation
also available from CRAN under the heading manuals.  For unix based systems it is
usual to download R in source form, but it is also available in binary form for most
common operating systems.  There are several excellent introductions
to R available in published form, in addition to the Introduction to R available in
pdf from the CRAN website.  I would particularly recommend \citeasnoun{Dalgaard} and
\citeasnoun{VR.mass}.  On the CRAN website there are also, under the heading
"contributed", introductions to R in Danish, French, German, Spanish Italian,
and a variety of other languages all of which can be freely downloaded in pdf.
I've prepared a brief R FAQ that I will distribute with the course materials for
the LSE short course.

For purposes of this course a minimal knowledge of R will suffice.  R can be
freely downloaded, and I hope that most students will bring a laptop so that
they have access to R during the course sessions.
Clicking the R icon should produce a window in which
R will be running.  To quit R, you just type \code{q()}, you will be prompted
to answer whether you want to save the objects that were created during the session;
responding ``yes'' will save the session objects into a file called .RData,
responding ``no'' will simply quit without saving.  Online help is provided
in two modes:  if you know what you are looking for, you can type, for example
\code{?rq} and you will get a description of the \code{rq} command, alternatively
you can type \code{help.start()} and a browser help window should pop up and you
can type more general key words or phrases to search for functionality.  

R is intended to be a convenient interactive language and you can do many things
on the fly by just typing commands into the R console, or even by pointing and
clicking at one of the GUI interfaces, but I find that it is often preferable to
save R commands into a file and execute a group of 
commands -- this encourages a more reproducible style of research -- and can
be easily done using the \code{source("commands.R")} command.  Saving output
is a bit more complicated since there are many forms of output, graphics are
usually saved in either postscript or pdf form, and tables can be saved in latex
format for subsequent inclusion in documents.  Together with Achim Zeileis,
U. of Innsbruck,  I've written a paper in {\it J. of Applied Econometrics}
on reproducible research strategies that describes some of these things in more
detail.  The paper and some other ranting and raving about reproducibility are
also available from my homepage by clicking first on ``papers'' and then on 
``Reproducible Econometric Research.''

An aspect of reproducibility that is rarely considered in econometrics is the
notion of ``literate programming.''  The idea of literate programming was first
broached by Donald Knuth in 1984;  Knuth essentially advocated merging code
and documentation for code in such a way that the code was self documenting
and the exposition was self-documenting as well, since the code that generated
the reported computations was embedded.  In the R language this viewpoint
has been implemented by Leisch's Sweave which can be considered to be a dialect
of latex that allows the user to embed R chunks that are executed prior to latex
compilation.  The document that you are now reading was written in Sweave and
can be viewed in its original form in the course problems directory as the file
\code{ex.Rnw}.

\section*{Problem 1:  A Family of Engel Curves}
This is a simple bivariate linear quantile regression exercise designed to explore some
basic features of the \code{quantreg} software in R.  The data consists of
observations on household food expenditure and household income of 235 working
class Belgian familes taken from the well-known study of  Ernst Engel (1857).

1.  Read the data.  The data can be downloaded from the website specified in class.
You will see that it has a conventional ascii format with a header line indicating the
variable names, and 235 lines of data, one per household.  This can be read in R by
the command

<<>>=
url <- "http://www.econ.uiuc.edu/~roger/courses/LSE/data/engel.data"
d <- read.table(file = url, header=TRUE) #data is now in matrix "d"
attach(d) #attaching makes the variables accessible by name.
@

2.  Plot the data.  After the attach command the data is available using the names
in the header, so we can plot the scatter diagram as:

\begin{center}
<<>>=
plot(x,y) 
@
\end{center}


3.  Replot with some better axis labels and superimpose some quantile 
regression lines on the scatter plot.

\begin{center}
<<fig=TRUE>>=
require(quantreg)
plot(x,y,cex=.25,type="n",xlab="Household Income", 
	ylab="Food Expenditure")
points(x,y,cex=.5,col="blue")
abline(rq(y~x,tau=.5),col="blue")
abline(lm(y~x),lty=2,col="red") #the dreaded ols line
taus <- c(.05,.1,.25,.75,.90,.95)
f <- rq(y ~ x, tau = taus)
for( i in 1:length(taus)){
        abline(coef(f)[,i],col="gray")
        }
@
\end{center}

Note that you have to load the quantreg package before invoking the \code{rq()} command.
Careful inspection of the plot reveals that the ols fit is severely biased at low
incomes due to a few outliers. 
The plot command has a lot of options to fine tune the plot.  There is a convenient
looping structure, but beware that it can be slow in some applications.  In
\code{rq()} there are also many options:  the first argument is a ``formula'' that 
specifies
the model that is desired, in this case we want to fit the simple bivariate linear model
so it is just \code{y\~{}x} if we had two covariates we could say, e.g. \code{y\~{}x+z}.

4.  If we wanted to see all the distinct quantile regression solutions for this
example we could specify a tau outside the range [0,1], e.g.

<<>>=
z <- rq(y~x,tau=-1)
@

Now if you look at components of the structure  \code{z} that are  returned by the command,
you can see for example the primal solution in \code{z\$sol}, and the dual solution 
in \code{ z\$dsol}.   In interactive mode just typing the name of some R object
causes the program to print the object in some more or less easily intelligible manner.
Now, if you want to estimate the conditional quantile function of \code{y} at a specific 
value of \code{x} and plot it you can do something like this:

\begin{center}
<<fig=TRUE>>=
#Poor is defined as at the .1 quantile of the sample distn
x.poor <- quantile(x,.1) 
#Rich is defined as at the .9 quantile of the sample distn
x.rich <- quantile(x,.9) 
ps <- z$sol[1,]
qs.poor <- c(c(1,x.poor)%*%z$sol[4:5,])
qs.rich <- c(c(1,x.rich)%*%z$sol[4:5,])
#now plot the two quantile functions to compare
plot(c(ps,ps),c(qs.poor,qs.rich),type="n",
	xlab=expression(tau),ylab="quantile")
plot(stepfun(ps,c(qs.poor[1],qs.poor)),do.points=FALSE,add=TRUE)
plot(stepfun(ps,c(qs.poor[1],qs.rich)),do.points=FALSE,add=TRUE)
#for conditional densities you could use akj()...
@
\end{center}

A nice feature of R is that documentation of functions usually includes some
examples of their usage.  These examples can be ``run'' by simply typing
\code{example(SomeFunctionName)}, so for example when you type \code{example(rq)}
you get a plot somewhat like the one you have just done ``by hand.''  In a second
plot you get a pair of coefficient plots that depict the estimate intercept and
slope coefficients as a function of $\tau$ and provide a confidence band.  More
on this later.  If you look carefully at the code being executing by in these
examples you will see that you didn't need to download the data from the url 
specified, the Engel data is available directly from the \pkg{quantreg} package
using the statement \code{data(engel)}.  But it is often handy to be able to
download data from the web.  There are quite a lot of tools for handling web
data sources, but this is another story entirely. 

If you look carefully at the plots of the two estimated quantile functions that 
you made you will see minor violations of the expected monotonicity of these
functions.  This may or may not be regarded as a mortal sin, depending on your
religious convictions.  One way to deal with this, recently suggested by \citeasnoun{CFVG}
is to ``rearrange'' the estimated functions.  See the the documentation for
this function using the usual strategy: \code{?rearrange}.  To see an example
of how this works try typing:  \code{example(rearrange)}.

5.  Now let's consider some formal testing.  For starters suppose we just estimate
two quartile fits and look at the default output:

<<>>=
fit.25 <- rq(y~x,tau=.25)
summary(fit.25)
fit.75 <- rq(y~x,tau=.75)
summary(fit.75)
@

By default the confidence intervals that are produced use the rank inversion method.
This is fine for judging whether covariates are significant at particular quantiles
but suppose that we wanted to test that the slopes were the same at the  two quantiles? 
This is done with the \code{anova} command as follows:

<<>>=
anova(fit.25,fit.75)
@

This is an example of a general class of tests proposed in \citeasnoun{KB.82b}
It is instructive to look at the code for the command \code{anova.rq} to see how this 
test is carried out.  The Wald approach is used and the asymptotic covariance matrix
is estimated using the approach of \citeasnoun{HK.92}.
It also illustrates a  general syntax for testing in R adapted to 
the QR situation.  If you have two models
that are nested, with fits say \code{f0} and \code{f1}, then \code{anova(f0,f1)} 
should test whether the restricted
model is correct.  One needs to be careful however to check that the hypothesis that
is intended, is really the one that the anova command understands, see \code{?anova.rq}
for further details on the QR version of this.  If you have more than two quantiles
and want to do a joint test that all the slope coefficients are the same at all the
quantiles you can use \code{anova(ft1,ft2,ft3,ft4)}.  

In very large problems the rank inversion
approach to confidence intervals is quite slow, and it is better to use another method.
There are several choices.  By default the computational method employs a variant of
the Barrodale and Roberts (simplex-like) algorithm, for problems with sample size
greater than about 5000 it is preferable to use interior point methods by using
the \code{method="fn"}, flag in the call to \code{rq}.  
When this "Frisch-Newton" version of the
algorithm is used, rank test confidence intervals are not provided by summary
instead a form of the Wald test is returned.  Various options can be specified
to produce various estimates of the standard errors as described below.
These Wald forms of estimating standard errors  are also possible to achieve with
the default \code{method="br"} setting by adding for example the flag \code{se=nid}.  
Details of the algorithms are provided in 
\citeasnoun{KO.87}, \citeasnoun{KO.93}, for the ``BR'' method and  \citeasnoun{PK.1997}
for the ``Frisch-Newton'' method.

Standard inference results are obtained by calling summary, e.g.
<<>>=
fit <- rq(y~x,tau=.27,method="fn")
summary(fit)
@
by default summary produces estimates of the asymptotic covariance matrix based on
the approach described in \citeasnoun{HK.92},  an alternative approach
suggested by \citeasnoun{Pow.89} can be obtained by specifying \code{se="ker"}.  
There are further
details and options regarding bandwidth and controlling the nature of what is 
returned by the summary command, see \code{?summary.rq} for these details.

At this point it would be useful to compare and contrast the various estimation
and inference options that are available.  Try estimating the simple model used
above with both the \code{method = "br")} and 
\code{method = "fn")} choices, and then compare some of the \code{se} options
in \code{summary.rq}. 

6.  The magic of logarithms.  Thus far we have considered  Engel functions that are
linear in form, and the scatter as well as the QR testing has revealed a strong
tendency for the dispersion of food expenditure to increase with household income.
This is a particularly common form of heteroscedasticity.  If one looks more carefully
at the fitting, one sees interesting departures from symmetry that would not be likely
to be revealed by the typical textbook testing for heteroscedasticity, however.
One common remedy for symptoms like this would be to reformulate the model
in log linear terms.  It is interesting to compare what happens after the log 
transformation with what we have already seen.  Consider the following plot:
\begin{center}
<<fig=TRUE>>=
plot(x,y,log="xy",xlab="Household Income", ylab="Food Expenditure")
taus <- c(.05,.1,.25,.75,.90,.95)
abline(rq(log10(y)~log10(x),tau=.5),col="blue")
for( i in 1:length(taus)){
       abline(rq(log10(y)~log10(x),tau=taus[i]),col="gray")
       }
@
\end{center}
Note that the flag \code{log="xy"} produces a plot with log-log axes, and for
convenience of axis labeling these logarithms are base 10, so the subsequent fitting
is also specified as base 10 logs for plotting purposes, even though base 10 logarithms
are {\it unnatural} and would never be used in reporting numerical results.
This looks much more like a classical iid error regression model, although again some
departure from symmetry is visible.  An interesting exercise is to conduct some
formal testing for departures from the iid assumption of the type already considered
above.  This is left as an exercise for the reader.  


\section*{Problem 2:  Nonparametric Quantile Regression}
Nonparametric quantile regression is most easily considered within a locally polynomial
framework.  Locally linear fitting can be carried out by the following function,
provided in the \code{quantreg} package:

<<>>=
"lprq" <-
function(x, y, h, m=50 , tau=.5)
{
        xx <- seq(min(x),max(x),length=m)
        fv <- xx
        der <- xx
        for(i in 1:length(xx)) {
                z <- x - xx[i]
                wx <- dnorm(z/h)
                r <- rq(y~z, weights=wx, tau=tau)
                fv[i] <- r$coef[1.]
                der[i] <- r$coef[2.]
        }
        list(xx = xx, fv = fv, der = der)
}
@
If you read through the function carefully you will see that it is just a matter
of computing a quantile regression fit at each of $m$ equally spaced $x$-values
over the support of the observed $x$ points.  The function value estimates  
are returned as \code{fv} and the first derivative estimates at the $m$ points
are returned as \code{der}.  As usual you can specify $\tau$, but now you also
need to specify a bandwidth \code{h}.

1.  Begin by exploring the effect of the \code{h} and \code{tau} arguments for
fitting the motorcycle data.  Note that fitting derivatives requires larger bandwidth
and larger sample size to achieve the same precision obtainable by function fitting.
You are encouraged to substitute a more economic data set for the ubiquitous motorcycle
data, its only advantage in the current context is that you can easily find examples 
to compare in the nonparametric regression literature.

2.  Adapt \code{lprq} so that it does locally quadratic rather than linear fitting
and compare performance.

3.  Another general strategy for nonparametric quantile regression that is relatively
simple to adapt to R uses regression splines.  The function \code{bs()} in the
package \code{splines} gives a very flexible way to construct B-spline basis 
expansions.  For example you can fit a model like this:

<<>>=
require(splines)
url <- "http://www.econ.uiuc.edu/~roger/courses/LSE/data/motorcycle.data"
d <- read.table(file = url, header=TRUE) 
fit <- rq(y~bs(x,df=5),tau=.33, data = d)
@
which fits a piecewise cubic polynomial with knots (breakpoints in the third derivative)
at quintiles of the $x$'s.  You can also explicitly specify the knot sequence and the order
of the spline.  One advantage of this approach is that it is very easy to add a partially
linear model component.  So if there is another covariate, say z, we can add a parametric
component like this: 
<<eval=FALSE>>=
fit <- rq(y~bs(x,df=5)+z,tau=.33)
@
This avoids complications of backfitting when using kernel methods for partially
linear models.  Compare some fitting using the spline approach with that 
obtained with the local polynomial kernel approach.

4.  Yet another even more appealing approach to univariate nonparametric 
smoothing involves penalty
methods as described for example in \citeasnoun{KNP.94}
In recent work, \citeasnoun{KM.02},  this approach has been extended to bivariate
nonparametric regression, and more recently to a general class of additive models.
Again, partially linear models are easily adapted, and there are easy ways to impose
monotonicity and convexity on the fitted functions.  In large problems it is essential
to take advantage of the sparsity of the linear algebra.  This is now feasible using
special versions of the interior point algorithm for quantile regression and the SparseM
package, \citeasnoun{KN.03}.  The paper \citeasnoun{K.10} describes some recent
developments of inference apparatus for these models.  Further development of
these methods would be aided by some additional experience with real data.

An important feature of these additive models is that it is possible to impose
monotonocity and/or convexity/concavity on the individual components.  There are
also relatively new methods for doing inference and prediction as well as plotting.
As usual you can experiment with these methods by trying the \code{example()}
function on methods like \code{summary.rqss}, \code{plot.rqss}, and \code{predict.rqss}. 
But more interesting would be to try new examples based on real data.

\section*{Problem 3:  Quantile Regression Survival Analysis}
Quantile regression as proven to be a particularly attractive approach for univariate
survival analysis (aka duration modeling).  The classical accelerated failure time 
model
\[
\log (T_i) = x_i^\top \beta + u_i
\]
with iid errors $u_i$, can be easily extended to consider,
\begin{equation}
\label{eq.qraft}
Q_{\log (T_i)} (\tau | x_i)  = x_i^\top \beta (\tau),
\end{equation}
yielding a flexible, yet parametrically parsimonious, approach.  

In this problem you are asked to explore such models in the context of the Pennsylvania
reemployment bonus experiment conducted in 1988-89.
In this period new claimants for unemployment insurance were
randomized into one of several treatment groups or a control group.  Control participants
abided by the usual rules of the unemployment insurance system; treatment participants
were offered a cash bonus to be awarded if the claimant was certifiably reemployed within
a specified qualification period.  For simplicity we will focus on only one of the
treatment groups, those offered a bonus of 6 times their weekly benefit provided
reemployment was established within 12 weeks.  For this group the bonus averaged about
\$ 1000 for those collecting it.  The data will be available in the form of an R data
set called \code{Penn46.data} in the same directory as we have indicated for
the prior datasets.  This can be read into R using the same procedure as
was used for the Engel data.  For a more detailed analysis incorporating  
the other treatments, see \citeasnoun{KB.01}.  See \citeasnoun{KX.02} for further details
on approaches to inference for these models. 

In this application interest naturally focuses on the effect of the binary, randomized
treatment.  How does the bonus influence the distribution of the duration of unemployment?
The Lehmann quantile treatment effect (QTE) is a natural object of empirical attention. 

1.  Explore some specifications of the QR model (\ref{eq.qraft}) and compare to fitting
the Cox proportional hazard specification.  See \code{require(survival)} for functions
to estimate the corresponding Cox models.   Note that covariate effects in the Cox
models are necessarily scalar in nature, so for example the treatment effect must
either increase, or decrease unemployment durations over the whole range of the
distribution, but it cannot decrease durations in the lower tail and increase them
in the upper tail -- unless the model is specified with distinct baseline hazard
functions for the two groups.  See \citeasnoun{KG.01} for some further details on
the relationship between the QR survival model and the Cox model.

2.  Explore some formal inference options to try to narrow the field of interesting
specifications.  See for example the discussion in \citeasnoun{KX.02} on tests
based on the whole QR process.

\section*{Problem 4:  Quantile Autoregression}
Consider a simple linear QAR model,
\[
y_t = \alpha_1 (u_t) y_{t-1} + \alpha_0 (u_t)  \quad t = 0, 1, ... , T
\]
where $u_t$ is iid $U[0,1]$.  Suppose that $\alpha_1 (u) = 0.85 + 0.25 u$ and
$\alpha_0 (u) = \Phi^{-1} (u)$ with $\Phi$ denoting the standard normal distribution
function.  Simulate a realization of this process with $T = 1000$ and estimate
and plot the QAR coefficients, comparing them to the usual OLS estimates.

Verify whether or not the process is stationary.  In your realization of the
process check to see whether $y_{t-1}$ stays in the region for which the conditional
quantile function of $y_t$ is monotone.  What is the usual OLS estimate of
the AR(1) model estimating in this case?  Check the residuals from the OLS
fit to see if they exhibit any suspicious features that would reveal what
is unusual here.

\section*{Problem 5:  Portfolio Choice}
This problem deals with the ``pessimistic portfolio allocation'' proposed in
\citeasnoun{BKK.03}.  The paper employs a highly artificial example.  Your task,
should you decide to accept it, is to produce a more realistic example using
real data.  Software implementing the methods of the paper is available as an R
package called \code{qrisk}.  This is {\it not} a CRAN package, but it is 
available from the url, \url{http://www.econ.uiuc.edu/~roger/research/risk/risk.html}
The R function \code{qrisk} in this package computes optimal
portfolio weights based on a matrix of observed, or simulated, asset returns using
a specified form of pessimistic Choquet preferences.  

\section*{Problem 6:  Inequality Decomposition}
The extensive literature on the measurement of inequality has devoted considerable
attention to the question of how to decompose changes in measurements of inequality.
If we observe increases in the Gini coefficient in a particular region over some
sample period, can we attribute these changes in some way to underlying changes in
covariates, or to changes in the effects of these covariates?  QR offers a convenient
general approach to this question.  Suppose that we have estimated a garden variety
wage equation model in QR form,
\begin{equation}
\label{eq.wage}
Q_{\log y} (\tau |x) = x^\top \beta (\tau),
\end{equation}
and we would like to compute a conditional Gini coefficient.

Recall that the Lorenz function of a univariate distribution with quantile function, $Q$,
is given by,
\[
\lambda (t) = \mu^{-1} \int_0^t Q(s)ds
\]
where $\mu = \int_0^1 Q(s)ds$ is the mean of the distribution.  The Gini coefficient is simply
twice the area between $\lambda (t)$ and the 45 degree line,
\[
\gamma = 1 - 2 \int_0^1 \lambda (t)dt.
\]

1.  Given the linear decomposition of the conditional quantile function in (\ref{eq.wage})
and the fact that the Gini coefficient is a linear functional of the quantile function,
formulate a conditional Gini decomposition for log wages, and interpret it.

2.  Over time we may wish to ``explain'' changes in {\it the} Gini coefficient by 
considering changes in the wage structure -- which we can interpret as $\beta (\tau)$
in (\ref{eq.wage}) -- and changes in the characteristics of the population -- which are
captured by the evolution of the distribution of $x$.  This way of thinking enables us 
to consider thought experiments such as, ``How would Gini have evolved if the
wage structure were fixed at some initial condition, but population characteristics
changed according to some specified pattern, historical or otherwise''.  Or alternatively,
suppose that we fix population characteristics and consider the evolution of the
the conditional components of Gini as $\beta_t (\tau)$ changes over time.  Decompositions
of this type have been considered in recent work of \citeasnoun{MM.01}.  
The Gini decomposition has also been recently considered 
by \citeasnoun{ABD} I would love to see a further applications along these lines.
\bibliography{abbrev,references}
\end{document}