Miskatonic University Press

Ref desk 2: Questions asked per week at a branch

r librarystats

So we have a nice long file that records the details of 87,464 reference desk interactions since February 2011.

$ wc -l libstats.csv
87464 libstats.csv
$ head -5 libstats.csv
4. Strategy-Based,In-person,5-10 minutes,Scott,Drop-in Desk,CC,02/01/2011 09:20:11 AM
4. Strategy-Based,In-person,10-20 minutes,Scott,Drop-in Desk,CC,02/01/2011 09:43:09 AM
4. Strategy-Based,In-person,5-10 minutes,Scott,Drop-in Desk,CC,02/01/2011 10:00:56 AM
3. Skill-Based: Non-Technical,Phone,5-10 minutes,Scott,Drop-in Desk,CC,02/01/2011 10:05:05 AM

Let's look at it in R. First load in three libraries we're going to need: lattice for graphics, Hadley Wickham's plyr for data manipulation, and chron to help us with dates. Then load the CSV file into a data frame called libstats.

> library(lattice)
> library(plyr)
> library(chron)

> libstats <- read.csv("libstats.csv")

Each line in the file represents a single reference desk interaction. I want to analyze them by the week, so I add a column that specifies which week the interaction happened. This seems to be a pretty ugly way of doing it, but it works. The week column gets filled with YYYY-MM-DD dates that are the Mondays of the week in question.

> libstats$week <- as.Date(cut(as.Date(libstats$timestamp, format="%m/%d/%Y %r"), "week", start.on.monday=TRUE))
> head(libstats,4)
                  question.type question.format    time.spent library.name
1             4. Strategy-Based       In-person  5-10 minutes        Scott
2             4. Strategy-Based       In-person 10-20 minutes        Scott
3             4. Strategy-Based       In-person  5-10 minutes        Scott
4 3. Skill-Based: Non-Technical           Phone  5-10 minutes        Scott
  location.name initials              timestamp       week
1  Drop-in Desk       CC 02/01/2011 09:20:11 AM 2011-01-31
2  Drop-in Desk       CC 02/01/2011 09:43:09 AM 2011-01-31
3  Drop-in Desk       CC 02/01/2011 10:00:56 AM 2011-01-31
4  Drop-in Desk       CC 02/01/2011 10:05:05 AM 2011-01-31
> up.to.week <- tail(levels(as.factor(libstats$week)), 1)

up.to.week is the most recent week date, and I'll use it for labelling charts. levels tells you the elements in a list of factors. The names of the library branches are a great example: there are eight different values for library.name through out 87,464 entries, one for each of our libraries plus one for an information desk that doesn't do research help. (The Osgoode Hall Law School Library doesn't record their reference statistics in this system so they're not here.)

> branches <- levels(libstats$library.name)
> branches
[1] "ASC"               "Bronfman"          "Frost"             "Maps"              "Scott"
[6] "Scott Information" "SMIL"              "Steacie"

Let's look at the statistics for the Bronfman library. Turns out there are 10,754 encounters recorded there.

> bronfman <- subset(libstats, library.name == "Bronfman")
> nrow(bronfman)
[1] 10754

We have a problem to solve before we can make a chart. Each line in the bronfman data frame records one desk enounter. We want to analyze things by the week. How do we aggregate a week's worth of data into one number? We'll use ddply, whose help files defines what it does as: "For each subset of a data frame, apply function then combine results into a data frame."

A short example will help explain it. Make a data frame about some coloured clothing. ddply(tmp, .(colour), nrow) means "look at the data frame called tmp, pick out the individual entries in the colour column, and run the function nrow on each element to find out how many of them there are." Using nrow here is a nice way of counting up how many of something there are, but if you were doing real statistics you might use mean or some other function.

> tmp <- data.frame(colour = c("red", "red", "green", "red", "blue"),
                    item = c("shirt", "socks", "shirt", "socks", "socks"))
> tmp
  colour  item
1    red shirt
2    red socks
3  green shirt
4    red socks
5   blue socks
> ddply(tmp, .(colour), nrow)
  colour V1
1   blue  1
2  green  1
3    red  3
> ddply(tmp, .(item), nrow)
   item V1
1 shirt  2
2 socks  3

Back to our bronfman data frame. For each week I want to know how many of question.type was asked:

> questions <- ddply(bronfman, .(question.type, week), nrow)w
> head(questions)
    question.type       week V1
1 1. Non-Resource 2011-01-31 44
2 1. Non-Resource 2011-02-07 58
3 1. Non-Resource 2011-02-14 43
4 1. Non-Resource 2011-02-21 20
5 1. Non-Resource 2011-02-28 49
6 1. Non-Resource 2011-03-07 37

This new data frame has a count of how many 1s were asked each week, then how many 2s, and so on up to how many 5s were asked each week. xyplot does the trick for making a chart of this:

> xyplot(V1 ~ as.Date(week) | question.type,
         data = questions,
         type = "h",
         main = "Questions asked at Bronfman",
         sub = paste("Feb 2011 to", up.to.week),
         ylab = "Number of questions",
         xlab = "Week",

Questions by week at a branch

Notice how nicely R figured out how to label the x and y axes. Because it knows that the week column consists of dates, it was able to divide up the x-axis into three-month chunks. Beautiful.