UNCHARTED DATA: Importing Multiple Files Quickly and Efficiently

Kyle Cuilla

Need for Speed

If you’re a data professional like me, odds are that at some point in your career you’ve needed to combine multiple files together.

For example, let’s say that your boss asks you to analyze the trend of the sales data for your company for the past three months. The sales data are stored in separate .csv files for each month and are named using a “month_year” format.

You know how to code in R, so rather than copying and pasting everything together in Excel, you run this simple script and save the data for the past three months into a single dataframe called “sales”:

Show code

file1 <- read.csv("october_2020.csv")
file2 <- read.csv("november_2020.csv")
file3 <- read.csv("december_2020.csv")

files <- rbind(file1, file2, file3)

Then you analyze the data and send it over to your boss. Your boss thanks you but tells you that you now need to analyze the sales trend over the past 10 years and it needs to be done ASAP that’s about to start. Suddenly, the method you used above doesn’t seem so efficient anymore now that you have 120 files to combine. Not only is it going to take time for you to type out the names of each file, but there’s also a decent chance that you make a typo, especially given that you’re crunched for time.

There has to be a faster and easier way of doing this, right?

Fast Functions

If you follow me on Twitter, you may already know the answer to that question is yes!

Here is what Jared Lander, an adjunct professor at Columbia University, said about my proposed solution in the “How it’s going” image on the right:

I have saved clients hundreds of hours with this one line. https://t.co/YNl0oCqZXn
— Jared Lander (@jaredlander) January 2, 2021

Here’s the solution I proposed in the picture above on the right so it’s easier to follow along:

Show code

files <- 
  list.files(pattern = "*.csv") %>%
  map_df(~fread(.))

To explain how this works, the function list.files() simply generates a list of all of the .csv files (as specified by pattern = "*.csv") that are located in your current working directory.

Note: depending on where the files are stored on your computer, you may need to change your working directory to the folder that contains the files. Also, if there are no other file types in your folder, then you don’t need the pattern = "*.csv" and just use list.files() instead.

The list of files are then piped into the map_df() function which comes from the purrr package:

Show code

library(tidyverse)

map_df

function (.x, .f, ..., .id = NULL) 
{
    if (!is_installed("dplyr")) {
        abort("`map_df()` requires dplyr")
    }
    .f <- as_mapper(.f, ...)
    res <- map(.x, .f, ...)
    dplyr::bind_rows(res, .id = .id)
}
<bytecode: 0x0000000020456068>
<environment: namespace:purrr>

You can see that map_df essentially takes the list of files and uses bind_rows() to combine them into a single dataframe by passing it through a function (.f) that you provide. In this case, the function that we are using is fread() which comes from the data.table package. fread() is essentially the same as read.csv() but it is significantly faster.

Note: if you want to use any of the arguments within fread() such as stringsAsFactors, select, fill, etc. then you need to include the ~ before calling fread and your argument options. If you don’t need to add any arguments then you can just simply use map_df(fread).

Importing and Combining 500 .csv Files in ~1 sec

To demonstrate just how fast and easy it is to use map_df() to import and combine files, I created a folder containing 500 .csv files.

The files, which were generated using Exportify, consist of different playlists found on Spotify. Below is an example of some of the playlists in the folder:

Show code

head(list.files())

[1] "'90s_pop_rock_essentials.csv" "00s_ballads.csv"             
[3] "00s_indie_dance_party.csv"    "00s_latino.csv"              
[5] "00s_rock_anthems.csv"         "10s_rock_anthems.csv"

To confirm there are 500 files in the folder, we can run length(list.files()) to count the total number of files:

Show code

length(list.files())

[1] 500

Now that we confirmed that all 500 files are present, let’s import and combine them all into one dataset that just consists of the artist, track, and album name of each song:

Show code

library(data.table)

files <-
  list.files() %>%
  map_df(~fread(., select = c("Artist Name", "Track Name", "Album Name")))

str(files)

Classes 'data.table' and 'data.frame':  32751 obs. of  3 variables:
 $ Artist Name: chr  "Third Eye Blind" "Counting Crows" "Spin Doctors" "Semisonic" ...
 $ Track Name : chr  "Semi-Charmed Life" "Mr. Jones" "Two Princes" "Closing Time" ...
 $ Album Name : chr  "Third Eye Blind" "August And Everything After" "Pocket Full Of Kryptonite" "Feeling Strangely Fine" ...
 - attr(*, ".internal.selfref")=<externalptr>

The output above shows that there are a total of 32,751 rows in our combined dataset (and three columns which we selected).

So now the question is how long did it take to import and combine 500 .csv files containing over 32k rows in total?

Show code

library(tictoc)

tic()
files <-
  list.files() %>%
  map_df(~fread(., select = c("Artist Name", "Track Name", "Album Name")))
toc()

0.72 sec elapsed

The answer: just over one second!

Bonus: Including File Names

Currently, our file contains the artist, track, and album name for each song as shown below:

Show code

head(files)

        Artist Name        Track Name                   Album Name
1:  Third Eye Blind Semi-Charmed Life              Third Eye Blind
2:   Counting Crows         Mr. Jones  August And Everything After
3:     Spin Doctors       Two Princes    Pocket Full Of Kryptonite
4:        Semisonic      Closing Time       Feeling Strangely Fine
5: Eagle-Eye Cherry      Save Tonight                   Desireless
6:  Matchbox Twenty              Push Yourself or Someone Like You

If you would like to add a column that says what playlist each song came from, we can create a function that inserts the file name as a column called ‘Playlist Name’ using mutate:

Show code

file_names <- function(x) {
  fread(x, select = c("Artist Name", "Track Name", "Album Name")) %>% 
    mutate("Playlist Name" = str_extract(x, '.*(?=\\.csv)'))
}

Note: I’m using str_extract above to remove the ‘.csv’ from each of the playlist names since they are present in the file names.

And then all we need to do is call that function within map_df() and now we have our dataset with the names of the playlists included:

Show code

files_with_names <-
  list.files() %>%
  map_df(file_names)

head(files_with_names)

        Artist Name        Track Name                   Album Name
1:  Third Eye Blind Semi-Charmed Life              Third Eye Blind
2:   Counting Crows         Mr. Jones  August And Everything After
3:     Spin Doctors       Two Princes    Pocket Full Of Kryptonite
4:        Semisonic      Closing Time       Feeling Strangely Fine
5: Eagle-Eye Cherry      Save Tonight                   Desireless
6:  Matchbox Twenty              Push Yourself or Someone Like You
              Playlist Name
1: '90s_pop_rock_essentials
2: '90s_pop_rock_essentials
3: '90s_pop_rock_essentials
4: '90s_pop_rock_essentials
5: '90s_pop_rock_essentials
6: '90s_pop_rock_essentials

Other Fast Functions

In addition to map_df() there are a couple other functions that you can use to import and combine files quickly and efficiently.

The first one uses rbindlist() from the data.table package:

Show code

tic()
files <- 
  rbindlist(lapply(list.files(), fread, select = c("Artist Name", "Track Name", "Album Name")))
toc()

0.64 sec elapsed

This is actually just a tick faster than map_df() and only requires one package (data.table) vs two (data.table and purrr) so it is a solid alternative.

The other function you can use is vroom() from the vroom package. One minor difference between vroom() vs the other two methods is that the results are stored in a tibble vs a dataframe:

Show code

library(vroom)

tic()
files <- 
  vroom(list.files(), col_select = c("Artist Name", "Track Name", "Album Name"))
toc()

1.36 sec elapsed

For this example, vroom() is a tiny bit slower than both map_df() and rbindlist() but is still another great alternative.

Speed Test Summary

Here is a summary of the three functions we’ve used and how each of them performed when importing and combining 500 .csv files using the microbenchmark package:

Show code

library(microbenchmark)

speed_test <- microbenchmark::microbenchmark(
  list.files() %>% map_df(~fread(., select = c("Artist Name", "Track Name", "Album Name"))),
  rbindlist(lapply(list.files(), fread, select = c("Artist Name", "Track Name", "Album Name"))),
  vroom(list.files(), col_select = c("Artist Name", "Track Name", "Album Name")),
  times = 10,
  unit = "s"
)

speed_test

Unit: seconds
                                                                                               expr
     list.files() %>% map_df(~fread(., select = c("Artist Name", "Track Name",      "Album Name")))
 rbindlist(lapply(list.files(), fread, select = c("Artist Name",      "Track Name", "Album Name")))
                vroom(list.files(), col_select = c("Artist Name", "Track Name",      "Album Name"))
       min        lq      mean    median        uq       max neval
 0.6942158 0.7135742 0.7247162 0.7206925 0.7397134 0.7610076    10
 0.5521009 0.5650346 0.5706224 0.5705313 0.5773364 0.5867329    10
 1.0408613 1.0548972 1.1123975 1.0894975 1.1900331 1.2344037    10

As you can see, all three functions we used (map_df(), rbindlist(), and vroom()) are incredibly fast at importing and combining files in R. For our scenario of combining 500 .csv files containing >32k rows in total, rbindlist() was the fastest, followed closely by map_df() and vroom(). However, depending on the number and size of the files you’re combining, the speeds for each method will vary and this may not always be the order in which they finish iterating. The bottom line is that it doesn’t really matter which of the three methods above that you use to import and combine files because they are all incredibly fast. What matters is that if you’re still using read.csv() and rbind() to import and combine your files, hopefully now you’re aware there are much easier, faster, and infallible ways to accomplish this task.

Importing Multiple Files Quickly and Efficiently

Need for Speed

Fast Functions

Importing and Combining 500 .csv Files in ~1 sec

Bonus: Including File Names

Other Fast Functions

Speed Test Summary

Citation