The R Language and MySQL

Statistical Analysis of 10n x p Datasets

Least Squares
Quantile Regression

MySQL

In order to use our package(s), you need to have access to a MySQL server, either locally or remotely.

MySQL is a Relational DataBase Management System (RDBMS) available at http://www.mysql.com, free of charge under the conditions described in the site.

If you need to setup your own MySQL server, please refer to the instructions provided by MySQL. The manual is quite thorough and we encourage you to read at least the security tips.

devshed.com has several references. An introduction to MySQL Administration is given here and first principles can be found here.

If you have privilegies to create a table under a particular database, then you may find useful the next example. We created a table named jepnat with the following fields (variables): ID, birmon, weight, boy, momage. The fields of statistical interest are birmon (month of birth), weight (in grams), boy (indicator) and momage (mom's age). ID is an identification field to which we assign a PRIMARY KEY. (The text in bold red represents the MySQL prompt; our input is in black).

mysql>
CREATE TABLE jepnat(
   -> ID INT NOT NULL AUTO_INCREMENT,
   -> birmon smallint unsigned not null,
   -> weight float notnull,
   -> boy tinyint unsigned not null,
   -> momage float notnull,
   -> PRIMARY KEY (ID));
Query OK, 0 rows affected (0.00 sec)

Although we defined the fields (e.g. birmon) and their characteristics (e.g., smallint, float, tinyint, not null), our table is still empty, we need to load the (ascii) data into the server:

mysql> LOAD DATA LOCAL INFILE "~/projects/sql-r/births.txt"
   -> INTO TABLE jepnat FIELDS TERMINATED BY ' '
   -> (birmon, weight,boy, momage);
Query OK, 2376497 rows affected (3 min 41.31 sec)
Records: 2376497  Deleted: 0  Skipped: 0  Warnings: 0

Notice how we omitted the field ID from our LOAD statement; ID is automatically created (and incremented).


Least Squares

A dataset with 2,000,000 observations per 18 variables loaded into R requires over 275MB of RAM. Furthermore, any statistically procedure, for example, least squares will further increase the memory requirements to values not available to the vast majority of R users (128MB is standard).

With that in mind, we developed a package, LM, to compute least squares with extremely large datasets, while maintaining memory requirements at low (common) levels. The principles of our package are:

1) Keep the data in the MySQL server;
2) While (n < N)
{     
fetch a % of the N rows, say, nper rows.
compute x'x and x'y
X'X = X'X + x'x
X'Y = X'Y + x'y
n = n + nper         
}
3) Upon fetching all N observations estimate the coefficients and statistics of relevance.
Here's a synopsis of the example included in the documentation of LM:

> library(LM)
> m <-  MySQL()
> mycon <- dbConnect(m, user='%', dbname="LM",
+              host="galton.econ.uiuc.edu")
> rs <- dbExecStatement( mycon, "SELECT * FROM LMData WHERE
+              ID < 501" )

> summary(LM( weight ~ momage + boy, data=rs, conn=mycon,
+              NumStdErr=TRUE ))


Call:
LM(formula = weight ~ momage + boy, data = rs, NumStdErr = TRUE,
   conn = mycon)

Residuals:
        Min         1Q   Median       3Q       Max
-2766.41  -314.82      45.06   350.15  1265.74

Coefficients:
                 Estimate  Std. Error  t value   Pr(>|t|)
(Intercept) 2947.916    123.967  23.780  <2e-16
momage       11.121       4.411    2.521    0.0120
boy           119.983      50.519    2.375    0.0179

Residual standard error: 563.2 on 497 degreesof freedom
Multiple R-Squared: 0.0248,     Adjusted R-squared: 0.02088
F-statistic: 6.321 on 2 and 497 degrees of freedom,     p-value: 0.001947



Download Area:

LM - Version 0.1-0
And the also the required RMySQL package.



Quantile Regression

This section is still under alpha development. Please check back periodically.




Page created by Alvaro A. Novo, Dec. 25, 2000
Updated by __ , __