Statistical Analysis of Large Datasets

An Exploration of R - MySQL Interface

    
Roger Koenker
University of Illinois

University of Illinois


Least Squares

Quantile Regression


LM: Least Squares on Large Datasets

We illustrate the approach with a model for infant birth weight based on the 1999 U.S. Detailed Natality Survey. The dataset contains roughly 2.4 million observations on 18 variables when loaded into R it requires about 275 MB of RAM. Furthermore, any nontrivial statistical procedure like least squares will further increase the memory requirements to values beyond the capabilities of the vast majority of R users.

With that in mind, we have been experimenting with techniques that would allow users to compute least squares solutions with extremely large datasets, while maintaining memory requirements at low (common) levels. The principles of ourpackage are:

0) Given N observations on a MySQL server;
1) Determine variables needed for X,y,w, etc
2) While (n < N)
{     
fetch a block of n rows.
accumulate X'X and X'y
        
}
3) Compute least squares solution by Cholesky.
Here's a synopsis of the example included in the documentation of LM, which we (naturally) limit to 500 observations on 3 variables. Later, we will report on an example with 2,376,497 observations on 18 variables.

We run the example over the network using a 56k V.90 modem connection with an Athlon 750 MHz with 384 MB of RAM powered by Linux 2.2.17 SuSE 6.4. The MySQL server is version 3.22.32. R-1.2.0 takes about 12 seconds to source and execute the code below:
                                                                               
> library(LM)
> m <-  MySQL()
> mycon <- dbConnect(m, user='%', dbname="LM",
+            host="galton.econ.uiuc.edu")
>  rs <- dbExecStatement( mycon, "SELECT * FROM LMData
+             WHERE ID < 501" )
                 
> summary(LM( weight ~ momage + boy, data=rs, conn=mycon,     NumStdErr=TRUE ))

Call:
LM(formula = weight ~ momage + boy, data = rs, NumStdErr = TRUE, conn = mycon)

Residuals:
Min
1Q
Median
3Q
Max
-2766.41
-314.82
45.06
350.15
1265.74
         

Coefficients:        

Estimate
Std. Error
t value
Pr(>|t|)
(Intercept)
2947.916
123.967
23.780
<2e-16
momage      
11.121
4.411
2.521
0.0120
boy          
119.983
50.519
2.375
0.0179

           
Residual standard error:563.2 on 497 degrees of freedom
Multiple R-Squared: 0.0248,  Adjusted R-squared: 0.02088
F-statistic: 6.321 on 2 and 497 degrees of freedom,    p-value: 0.001947



In the next example, we run a model with 16 left-hand side variableson 2,376,497 observations. We run the example on a 100Mbps network with a dual Pentium III at 650 MHz with 512 MB of RAM powered by Linux 2.2.16-SMP SuSE 6.4.

It takes only about 10.5 minutes to complete. And, total memory usage never exceeds 115 MB.

These numbers may not seem too impressive, but if we consider that in order to load the ascii dataset to R with 'scan' it used all 512 MB of RAM plus some 300 MB of swap space, there are certainly substantial memory gains. Furthermore, just re-opening R with the dataset in .RData takes over 2.5 minutes and requires 272 MB! We were unable to use `read.table'; it drained all memory resources (1GB = RAM + swap)! We also used the package 'stataread' which turned out to be the fastest and less memory intensive. Finally, MySQL also seems quite efficient reading ascii data, it took only 3 min. 41.31 sec.

It's possible to reduce memory requirements by fetching fewer observations at a time; in our example we fetched 100,000 each time. This will not necessarily result in longer waiting times, but it's certainly more feasible. It appears that most of the time is occupied loading each block into a dataframe. In a separate exercise, fetching data from a MySQL server took approximately the same time as reading the data from text files with 'read.table ' (scan is faster). So, it would be interesting to investigate whether significant efficiency gains are possible in this phase of the process. We would be very interested in comments about this aspect of LM.



Download Area:

LM - Version 0.1-0
And the also the required RMySQL package.




Quantile Regression

The LM development project can be considered to be a successful feasibility study for a comparable functionality for quantile regression. We hope to have an alpha version of the quantile regression software in the near future. Stay tuned for further developments.



MySQL

In order to use our package(s), you need to have access to a MySQL server, either locally or remotely. Therefore, we will briefly describe the installation, administration and client use of MySQL.

You can skip this section altogether if you have experience with MySQL.

MySQL is a Relational DataBase Management System (RDBMS) available at http://www.mysql.com, free of charge under the conditions described in the site.

Installation

You will need at least a MySQL client (to talk with a MySQL server). MySQL is available for several OS (Unix, Linux, FreeBSD and Windows). If you don't have access to a server, please consider installing your own server. Under SuSE 6.4, we installed the following rpm packages: mysqlclnt (the client), mysql (the server), mysqllib (libraries) and mysqldev (development files and libraries).


For additional instructions please refer to the manual  provided by MySQL, which is quite thorough. We encourage you to read at least the security tips.

Administration


This is intended to be a brief introduction to administering a server and it is fully based on our experience with SuSE 6.4.

Once you have installed MySQL server, you need to:

0) Start the mysql daemon (e.g. rcmysql start);
1) Assign a password to the user root (the administrator):
root@galton$ mysqladmin -u root password yourpasswd
2) Add a user with certain access privileges to databases. For illustrative purposes, we will consider the user "dummy" and the database "dbdummy" restricting access from the localhost and "world.dummy.org".

                                      
anovo@galton$ mysql -u root -p mysql

a) Update the 'user' table

mysql> insert into user (host,user,password)
   ->   values('localhost','dummy',password('NOTeasy'));
mysql> insert into user (host,user,password)
   ->   values('world.dummy.org','dummy',password('NOTeasy'));

b) Update the 'db' table

mysql> insert into db
   -> (host,db,user,Select_priv,Insert_priv,Update_priv,Delete_priv,
   ->   Create_priv,Drop_priv)
   -> values ('%','dbdummy','dummy','Y','Y','Y','Y','Y','Y');

c) Create the necessary database (dbdummy). First, exit mysql. Then,

anovo@galton$ mysqladmin -u root -p create dbdummy
enter password: *******

Database "dbdummy" created

d) Reload the server

anovo@galton$ mysqladmin -u root -p reload;
enter password: *******

3) Create and load the data into tables.
Notice that we gave permission to 'dummy' ( see 2b) ).

dummy@galton$ mysql -p dbdummy
enter password: *******

Welcome to the MySQL monitor. 
Commands end with ; or \g.
Your MySQL connection id is 5602
to server version: 3.22.32

Type 'help' for help.

mysql> CREATE TABLE dummynat(
   -> ID INT NOT NULL AUTO_INCREMENT,
   -> birmon smallint unsigned not null,
   -> weight float not null,
   -> boy tinyint unsigned not null,
   -> momage float not null,
   -> PRIMARY KEY (ID));
Query OK, 0 rows affected (0.00sec)

4) Although we defined the fields (e.g. birmon) and their characteristics (e.g., smallint,float, tinyint, not null), our table is still empty, we need to load the (ascii) data into the server:

mysql> LOAD DATA LOCAL INFILE"~/projects/sql-r/births.txt"
   -> INTO TABLE dummynat FIELDS TERMINATED BY ' '
   -> (birmon, weight, boy, momage);
Query OK, 2376497 rows affected(3 min 41.31sec)
Records: 2376497  Deleted: 0  Skipped: 0  Warnings: 0

We created a table named 'dummynat' with the following fields (variables): ID, birmon, weight, boy, momage. The fields of statistical interest are birmon (month of birth), weight (in grams), boy (indicator) and momage (mom's age). ID is an identification field to which we assign a PRIMARY KEY. (The text in red represents the MySQL prompt; our input is in black). Notice how we omitted the field ID from our LOAD statement; ID is automatically created (and incremented).

Client Use

This section illustrates basic SQL statements to retrieve specific pieces of information contained in 'dummynat' table.

mysql> SHOW tables;
+--------------+
| Tables in LM |
+--------------+
| dummynat     |
+--------------+
1 row in set (0.16 sec)

mysql> SELECT * FROM dummynat LIMIT 2;
+----+--------+---------+-----+--------+
| ID | birmon | weight  | boy | momage |
+----+--------+---------+-----+--------+
|   1|      1 | 4054.00 |   0 |  36.00 |
|   2|      2 | 3487.00 |   0 |  20.00 |
+----+--------+---------+-----+--------+
2 rows inset (0.22 sec)

mysql> SELECT weight FROM dummynat WHERE boy=1 AND momage>40;
+--------+
| weight |
+--------+
| 3799.00|
| 3459.00|
| 4014.00|
| 2268.00|
| 2580.00|
| 3146.00|
| 3125.00|
| 3525.00|
| 2268.00|
| 2965.00|
| 3771.00|
+--------+
11 rows in set (0.24 sec)



devshed.com has several references. An introduction to MySQL Administration is given here and first principles can be found here.



Page created by Álvaro A. Novo , Dec. 25, 2000
Updated by __ , __