Save countless hours with just one line of code.
If you’re a data professional like me, odds are that at some point in your career you’ve needed to combine multiple files together.
For example, let’s say that your boss asks you to analyze the trend of the sales data for your company for the past three months. The sales data are stored in separate .csv files for each month and are named using a “month_year” format.
You know how to code in R, so rather than copying and pasting everything together in Excel, you run this simple script and save the data for the past three months into a single dataframe called “sales”:
Then you analyze the data and send it over to your boss. Your boss thanks you but tells you that you now need to analyze the sales trend over the past 10 years and it needs to be done ASAP that’s about to start. Suddenly, the method you used above doesn’t seem so efficient anymore now that you have 120 files to combine. Not only is it going to take time for you to type out the names of each file, but there’s also a decent chance that you make a typo, especially given that you’re crunched for time.
There has to be a faster and easier way of doing this, right?
If you follow me on Twitter, you may already know the answer to that question is yes!
Here is what Jared Lander, an adjunct professor at Columbia University, said about my proposed solution in the “How it’s going” image on the right:
I have saved clients hundreds of hours with this one line. https://t.co/YNl0oCqZXn
— Jared Lander (@jaredlander) January 2, 2021
Here’s the solution I proposed in the picture above on the right so it’s easier to follow along:
files <-
list.files(pattern = "*.csv") %>%
map_df(~fread(.))
To explain how this works, the function list.files()
simply generates a list of all of the .csv files (as specified by pattern = "*.csv"
) that are located in your current working directory.
Note: depending on where the files are stored on your computer, you may need to change your working directory to the folder that contains the files. Also, if there are no other file types in your folder, then you don’t need the pattern = "*.csv"
and just use list.files()
instead.
The list of files are then piped into the map_df()
function which comes from the purrr package:
function (.x, .f, ..., .id = NULL)
{
if (!is_installed("dplyr")) {
abort("`map_df()` requires dplyr")
}
.f <- as_mapper(.f, ...)
res <- map(.x, .f, ...)
dplyr::bind_rows(res, .id = .id)
}
<bytecode: 0x0000000020456068>
<environment: namespace:purrr>
You can see that map_df
essentially takes the list of files and uses bind_rows()
to combine them into a single dataframe by passing it through a function (.f
) that you provide. In this case, the function that we are using is fread()
which comes from the data.table package. fread()
is essentially the same as read.csv()
but it is significantly faster.
Note: if you want to use any of the arguments within fread()
such as stringsAsFactors
, select
, fill
, etc. then you need to include the ~
before calling fread and your argument options. If you don’t need to add any arguments then you can just simply use map_df(fread)
.
To demonstrate just how fast and easy it is to use map_df()
to import and combine files, I created a folder containing 500 .csv files.
The files, which were generated using Exportify, consist of different playlists found on Spotify. Below is an example of some of the playlists in the folder:
head(list.files())
[1] "'90s_pop_rock_essentials.csv" "00s_ballads.csv"
[3] "00s_indie_dance_party.csv" "00s_latino.csv"
[5] "00s_rock_anthems.csv" "10s_rock_anthems.csv"
To confirm there are 500 files in the folder, we can run length(list.files())
to count the total number of files:
length(list.files())
[1] 500
Now that we confirmed that all 500 files are present, let’s import and combine them all into one dataset that just consists of the artist, track, and album name of each song:
library(data.table)
files <-
list.files() %>%
map_df(~fread(., select = c("Artist Name", "Track Name", "Album Name")))
str(files)
Classes 'data.table' and 'data.frame': 32751 obs. of 3 variables:
$ Artist Name: chr "Third Eye Blind" "Counting Crows" "Spin Doctors" "Semisonic" ...
$ Track Name : chr "Semi-Charmed Life" "Mr. Jones" "Two Princes" "Closing Time" ...
$ Album Name : chr "Third Eye Blind" "August And Everything After" "Pocket Full Of Kryptonite" "Feeling Strangely Fine" ...
- attr(*, ".internal.selfref")=<externalptr>
The output above shows that there are a total of 32,751 rows in our combined dataset (and three columns which we selected).
So now the question is how long did it take to import and combine 500 .csv files containing over 32k rows in total?
0.72 sec elapsed
The answer: just over one second!
Currently, our file contains the artist, track, and album name for each song as shown below:
head(files)
Artist Name Track Name Album Name
1: Third Eye Blind Semi-Charmed Life Third Eye Blind
2: Counting Crows Mr. Jones August And Everything After
3: Spin Doctors Two Princes Pocket Full Of Kryptonite
4: Semisonic Closing Time Feeling Strangely Fine
5: Eagle-Eye Cherry Save Tonight Desireless
6: Matchbox Twenty Push Yourself or Someone Like You
If you would like to add a column that says what playlist each song came from, we can create a function that inserts the file name as a column called ‘Playlist Name’ using mutate
:
Note: I’m using str_extract
above to remove the ‘.csv’ from each of the playlist names since they are present in the file names.
And then all we need to do is call that function within map_df()
and now we have our dataset with the names of the playlists included:
files_with_names <-
list.files() %>%
map_df(file_names)
head(files_with_names)
Artist Name Track Name Album Name
1: Third Eye Blind Semi-Charmed Life Third Eye Blind
2: Counting Crows Mr. Jones August And Everything After
3: Spin Doctors Two Princes Pocket Full Of Kryptonite
4: Semisonic Closing Time Feeling Strangely Fine
5: Eagle-Eye Cherry Save Tonight Desireless
6: Matchbox Twenty Push Yourself or Someone Like You
Playlist Name
1: '90s_pop_rock_essentials
2: '90s_pop_rock_essentials
3: '90s_pop_rock_essentials
4: '90s_pop_rock_essentials
5: '90s_pop_rock_essentials
6: '90s_pop_rock_essentials
In addition to map_df()
there are a couple other functions that you can use to import and combine files quickly and efficiently.
The first one uses rbindlist()
from the data.table package:
0.64 sec elapsed
This is actually just a tick faster than map_df()
and only requires one package (data.table) vs two (data.table and purrr) so it is a solid alternative.
The other function you can use is vroom()
from the vroom package. One minor difference between vroom()
vs the other two methods is that the results are stored in a tibble vs a dataframe:
1.36 sec elapsed
For this example, vroom()
is a tiny bit slower than both map_df()
and rbindlist()
but is still another great alternative.
Here is a summary of the three functions we’ve used and how each of them performed when importing and combining 500 .csv files using the microbenchmark package:
library(microbenchmark)
speed_test <- microbenchmark::microbenchmark(
list.files() %>% map_df(~fread(., select = c("Artist Name", "Track Name", "Album Name"))),
rbindlist(lapply(list.files(), fread, select = c("Artist Name", "Track Name", "Album Name"))),
vroom(list.files(), col_select = c("Artist Name", "Track Name", "Album Name")),
times = 10,
unit = "s"
)
speed_test
Unit: seconds
expr
list.files() %>% map_df(~fread(., select = c("Artist Name", "Track Name", "Album Name")))
rbindlist(lapply(list.files(), fread, select = c("Artist Name", "Track Name", "Album Name")))
vroom(list.files(), col_select = c("Artist Name", "Track Name", "Album Name"))
min lq mean median uq max neval
0.6942158 0.7135742 0.7247162 0.7206925 0.7397134 0.7610076 10
0.5521009 0.5650346 0.5706224 0.5705313 0.5773364 0.5867329 10
1.0408613 1.0548972 1.1123975 1.0894975 1.1900331 1.2344037 10
As you can see, all three functions we used (map_df()
, rbindlist()
, and vroom()
) are incredibly fast at importing and combining files in R. For our scenario of combining 500 .csv files containing >32k rows in total, rbindlist()
was the fastest, followed closely by map_df()
and vroom()
. However, depending on the number and size of the files you’re combining, the speeds for each method will vary and this may not always be the order in which they finish iterating. The bottom line is that it doesn’t really matter which of the three methods above that you use to import and combine files because they are all incredibly fast. What matters is that if you’re still using read.csv()
and rbind()
to import and combine your files, hopefully now you’re aware there are much easier, faster, and infallible ways to accomplish this task.
For attribution, please cite this work as
Cuilla (2021, Jan. 30). UNCHARTED DATA: Importing Multiple Files Quickly and Efficiently. Retrieved from https://uncharteddata.netlify.app/posts/2021-03-10-importing-multiple-files-quickly-and-efficiently/
BibTeX citation
@misc{cuilla2021importing, author = {Cuilla, Kyle}, title = {UNCHARTED DATA: Importing Multiple Files Quickly and Efficiently}, url = {https://uncharteddata.netlify.app/posts/2021-03-10-importing-multiple-files-quickly-and-efficiently/}, year = {2021} }