Those scripts did their job, but they were ugly, and there were some more things I wanted to do. Because of my recent Ubuntu upgrade, I’m running R version 3.0.2 now, which means I can use the new dplyr package by R wizard Hadley Wickham and others. (It doesn’t work on 3.0.1.) The vignette for dplyr has lots of examples, and I’ve been seeing great posts about it, and I was eager to try it. So I’m going back to the old work and refreshing it and figuring out how to do what I wanted to do in 2012—or couldn’t because we only had one year of data; now that we have four, year-to-year comparisons are interesting.
This first post is about how I used to do things in an ugly and slow way, and how to do them faster and better.
I begin with a CSV file containing a slightly munged and cleaned dump of all the information from LibStats.
I read the CSV file into a data frame, then fix a couple of things. The date is a string and needs to be turned into a Date, and I use a nice function from lubridate to find the floor of the date, which aggregates everything to the month it’s in.
The columns are:
timestamp: timestamp (not in a standard format)
question.type: one of five categories of question (1 = directional, 5 = specialized)
question.format: how or where the question was asked (in person, phone, chat)
time.spent: time spent giving help
library.name: the library name
location.name: where in the library (ref desk, office, info desk)
initials: initials of the person (or people) who helped
Now I have these fields in the data frame that I will use:
But I’m going to just take a sample of all of this data, because this is just for illustrative purposes, not real analysis. Let’s grab 10,000 random entries from this data frame and put that into l.sample.
An easy thing to ask first is: How many questions are asked each month in each library?
Here’s how I did it before. I’ll run the command and show the resulting data frame. I used the plyr package, which is (was) great, and its ddply function, which applies a function to a data frame and gives a data frame back. Here I have it collapse the data frame l along the two columns specified (month and library.name) and use nrow to count how many rows result. Then I check how long it would take to perform that operation on the entire data set.
The system.time line there show how long the previous command takes to run on the entire data frame: almost 3.5 seconds! That is slow. Do a few of those, chopping and slicing the data in various ways, and it will really add up.
This is a bad way of doing it. It works! But it’s slow and I wasn’t thinking about the problem the right way. Using ddply and nrow was wrong: I should have been using count (also from plyr), which I wrote up a while back, with some examples. That’s a much faster and more sensible way of counting up the number of rows in a data set.
But now that I can use dplyr, I can approach the problem in a whole new way.
First, I’ll clear plyr out of the way, then load dplyr. Doing it this way means no function names collide.
See how nicely you can construct and chain operations with dplyr:
The %.% operator lets you chain together different operations, and just for the sake of clarity of reading, I like to arrange things so first I specify the data frame on its own and then walk through the things I do to it. First, group_by breaks down the data frame by columns and does some magic. Then summarise collapses the different chunks of resulting data into one line each, and I use count=n() to make a new column, count, which contains the count of how many rows there were in each chunk, calculated with the n() function. In English I’m saying, “take the l data frame, group it by month and library.name, and count how many rows are in each grouping.” (Also, notice I didn’t need to use the head command to stop it running off the screen, it made it nicely readable on its own.)
It’s easier to think about, it’s easier to read, it’s easier to play with … and it’s much faster. How long would this take to run on the entire data set?
0.03 seconds elapsed time! That is 0.9% of the 3.35 seconds the old way.
Graphing it is easy, using Hadley Wickham’s marvellous ggplot2 package.
You can see the ebb and flow of the academic year: September, October and November are very busy, then things quiet down in December, then January, February and March busy, then it cools off in April and through the summer. (Students don’t ask a lot of questions close to and during exam time—they’re studying, and their assignments are finished.)
What about comparing year to year? Here’s a nice way of doing that.
First, pick out the numbers of the months and years. The format command knows all about how to handle dates and times. See the man page for strptime or your favourite language’s date manipulation commands for all the options possible. Here I use %m to find the month number and %Y to find the four-digit year. Two examples, then the commands:
This plot changes the x-axis to the year, and facets along two variables, breaking the the chart up vertically by library and horizontally by month. It’s easy now to see how months compare to each other across years.
With a little more work we can rotate the x-axis labels so they’re readable, and put month names along the top. The month function from lubridate makes this easy.