Reading large tables into R

Reading large tables from text files into R is possible but knowing a few tricks will make your life a lot easier and make R run a lot faster.

First, read the help page for ' read.table'. It contains many hints for how to read in large tables. Of course, help pages tend to be a little confusing so I'll try to distill the relevant details here.

The following options to 'read.table()' can affect R's ability to read large tables:

colClasses

This option takes a vector whose length is equal to the number of columns in year table. Specifying this option instead of using the default can make 'read.table' run MUCH faster, often twice as fast. In order to use this option, you have to know the class of each column in your data frame. If all of the columns are "numeric", for example, then you can just set 'colClasses = "numeric"'. If the columns are all different classes, or perhaps you just don't know, then you can have R do some of the work for you.

You can read in just a few rows of the table and then create a vector of classes from just the few rows. For example, if I have a file called "datatable.txt", I can read in the first 5 rows and determine the column classes from that:

tab5rows <- read.table("datatable.txt", header = TRUE, nrows = 5)
classes <- sapply(tab5rows, class)
tabAll <- read.table("datatable.txt", header = TRUE, colClasses = classes)

Always try to use 'colClasses', it will make a big difference.

nrows

Specifying the 'nrows' argument doesn't necessary make things go faster but it can help a lot with memory usage. R doesn't know how many rows it's going to read in so it first makes a guess, and then when it runs out of room it allocates more memory. The constant allocations can take a lot of time, and if R overestimates the amount of memory it needs, your computer might run out of memory.

Of course, you may not know how many rows your table has. The easiest way to find this out is to use the 'wc' command in Unix. So if you run 'wc datafile.txt' in Unix, then it will report to you the number of lines in the file (the first number). You can then pass this number to the 'nrows' argument of 'read.table()'.

If you can't use 'wc' for some reason, but you know that there are definitely less than, say, N rows, then you can specify 'nrows = N' and things will still be okay. A mild overestimate for 'nrows' is better than none at all.

comment.char

If your file has no comments in it (e.g. lines starting with '#'), then setting 'comment.char = ""' will sometimes make 'read.table()' run faster.