LM: Least Squares on Large Datasets
We illustrate the approach with a model for infant birth weight based on the 1999 U.S. Detailed Natality Survey. The dataset contains roughly 2.4 million observations on 18 variables when loaded into R it requires about 275 MB of RAM. Furthermore, any nontrivial statistical procedure like least squares will further increase the memory requirements to values beyond the capabilities of the vast majority of R users.
With that in mind, we have been experimenting with techniques that would allow users to compute least squares solutions with extremely large datasets, while maintaining memory requirements at low (common) levels. The principles of ourpackage are:
Here's a synopsis of the example included in the documentation of LM, which we (naturally) limit to 500 observations on 3 variables. Later, we will report on an example with 2,376,497 observations on 18 variables.
We run the example over the network using a 56k V.90 modem connection with an Athlon 750 MHz with 384 MB of RAM powered by Linux 2.2.17 SuSE 6.4. The MySQL server is version 3.22.32. R-1.2.0 takes about 12 seconds to source and execute the code below:
In the next example, we run a model with 16 left-hand side variableson 2,376,497 observations. We run the example on a 100Mbps network with a dual Pentium III at 650 MHz with 512 MB of RAM powered by Linux 2.2.16-SMP SuSE 6.4.
It takes only about 10.5 minutes to complete. And, total memory usage never exceeds 115 MB.
These numbers may not seem too impressive, but if we consider that in order to load the ascii dataset to R with 'scan' it used all 512 MB of RAM plus some 300 MB of swap space, there are certainly substantial memory gains. Furthermore, just re-opening R with the dataset in .RData takes over 2.5 minutes and requires 272 MB! We were unable to use `read.table'; it drained all memory resources (1GB = RAM + swap)! We also used the package 'stataread' which turned out to be the fastest and less memory intensive. Finally, MySQL also seems quite efficient reading ascii data, it took only 3 min. 41.31 sec.
It's possible to reduce memory requirements by fetching fewer observations at a time; in our example we fetched 100,000 each time. This will not necessarily result in longer waiting times, but it's certainly more feasible. It appears that most of the time is occupied loading each block into a dataframe. In a separate exercise, fetching data from a MySQL server took approximately the same time as reading the data from text files with 'read.table ' (scan is faster). So, it would be interesting to investigate whether significant efficiency gains are possible in this phase of the process. We would be very interested in comments about this aspect of LM.
LM - Version 0.1-0
And the also the required RMySQL package.
The LM development project can be considered to be a successful feasibility study for a comparable functionality for quantile regression. We hope to have an alpha version of the quantile regression software in the near future. Stay tuned for further developments.
In order to use our package(s), you need to have access to a MySQL server, either locally or remotely. Therefore, we will briefly describe the installation, administration and client use of MySQL.
You can skip this section altogether if you have experience with MySQL.
MySQL is a Relational DataBase Management System (RDBMS) available at http://www.mysql.com, free of charge under the conditions described in the site.
You will need at least a MySQL client (to talk with a MySQL server). MySQL is available for several OS (Unix, Linux, FreeBSD and Windows). If you don't have access to a server, please consider installing your own server. Under SuSE 6.4, we installed the following rpm packages: mysqlclnt (the client), mysql (the server), mysqllib (libraries) and mysqldev (development files and libraries).
For additional instructions please refer to the manual provided by MySQL, which is quite thorough. We encourage you to read at least the security tips.
This is intended to be a brief introduction to administering a server and it is fully based on our experience with SuSE 6.4.
Once you have installed MySQL server, you need to:
0) Start the mysql daemon (e.g. rcmysql start);
1) Assign a password to the user root (the administrator):
root@galton$ mysqladmin -u root password yourpasswd2) Add a user with certain access privileges to databases. For illustrative purposes, we will consider the user "dummy" and the database "dbdummy" restricting access from the localhost and "world.dummy.org".
3) Create and load the data into tables.
Notice that we gave permission to 'dummy' ( see 2b) ).
4) Although we defined the fields (e.g. birmon) and their characteristics (e.g., smallint,float, tinyint, not null), our table is still empty, we need to load the (ascii) data into the server:
We created a table named 'dummynat' with the following fields (variables): ID, birmon, weight, boy, momage. The fields of statistical interest are birmon (month of birth), weight (in grams), boy (indicator) and momage (mom's age). ID is an identification field to which we assign a PRIMARY KEY. (The text in red represents the MySQL prompt; our input is in black). Notice how we omitted the field ID from our LOAD statement; ID is automatically created (and incremented).
This section illustrates basic SQL statements to retrieve specific pieces of information contained in 'dummynat' table.
devshed.com has several references. An introduction to MySQL Administration is given here and first principles can be found here.