Better ways of using R on LibStats (3): Fifteen minutes of research support

← Open letter to the Toronto Public Library about Hoopla and privacy The Great Eastern →

Better ways of using R on LibStats (3): Fifteen minutes of research support

30 April 2014 r librarystats

Another update about using R to analyze the reference desk statistics that we keep at York University Libraries with LibStats (see also Better ways of using R on LibStats (1) and Better ways of using R on LibStats (2): durations, which is about how long we spend helping people at the desk). This is more about time spent, and it rehauls what I wrote up in May 2011 in Ref desk 5: Fifteen minutes for under one per cent. There I said:

Put those two charts together and it shows that during term time we spend on average about fifteen minutes a week giving research help to each of under one per cent of our students.

That’s still true, but now I can calculate it faster and make better charts to show it. And I figure it monthly: weekly showed some nice variations, but monthly is very easy to deal with, and nicely handles the three busy months per term.

As before, I get the data ready.

> library(ggplot2)
> library(dplyr)
> library(lubridate)
> library(scales) # To use better date formatting in axis labels
> l <- read.csv("~/R/yul/libstats/libstats.csv")
> l$day <- as.Date(l$timestamp, format="%m/%d/%Y %r")
> l$month <- floor_date(l$day, "month")

Thanks to a suggestion from Hadley Wickham (whose dplyr and ggplot2 packages I use all the time), I calculate the estimated durations of each encounter by using inner_join from dplyr, which does the same thing as merge but faster and more tidily.

The data frame is too big to look at nicely, so I’ll pick out three columns.

> head(l[c("day", "question.type", "time.spent")])
         day                 question.type    time.spent
1 2011-02-01             4. Strategy-Based  5-10 minutes
2 2011-02-01             4. Strategy-Based 10-20 minutes
3 2011-02-01             4. Strategy-Based  5-10 minutes
4 2011-02-01 3. Skill-Based: Non-Technical  5-10 minutes
5 2011-02-01             4. Strategy-Based  5-10 minutes
6 2011-02-01             4. Strategy-Based  5-10 minutes

I want to add a new column, est.duration, that turns the time.spent column into an estimate of how many minutes each interaction took. 5-10 minutes becomes 8, 10-20 minutes becomes 15, etc. So I make a durations data frame and do an inner_join with l that adds a new column and puts the right value in every row.

> durations <- data.frame(
  time.spent    = c("0-1 minute", "1-5 minutes", "5-10 minutes", "10-20 minutes", "20-30 minutes", "30-60 minutes", "60+ minutes"),
  est.duration  = c( 1, 4, 8, 15, 25, 40, 65)
  )
> durations
     time.spent est.duration
1    0-1 minute            1
2   1-5 minutes            4
3  5-10 minutes            8
4 10-20 minutes           15
5 20-30 minutes           25
6 30-60 minutes           40
7   60+ minutes           65
)
> l <- inner_join(l, durations, by="time.spent")
> head(l[c("day", "question.type", "time.spent", "est.duration")])
         day                 question.type time.spent est.duration
1 2011-02-03               1. Non-Resource 0-1 minute            1
2 2011-02-06 3. Skill-Based: Non-Technical 0-1 minute            1
3 2011-02-11               1. Non-Resource 0-1 minute            1
4 2011-02-16               1. Non-Resource 0-1 minute            1
5 2011-02-17               1. Non-Resource 0-1 minute            1
6 2011-02-17               1. Non-Resource 0-1 minute            1

The rows got reordered, but that doesn’t matter. The est.duration column does actually have the right numbers in it, it’s not just all 1s.

I’ll skip over some of the rest of the preparation, which I explained last time, and get on to figuring out about research questions. First, I use dplyr to make a new data frame that collapses just the research questions into monthly summaries of how many were asked and how long they took.

> research.pm <- l %.% filter(question.type %in% research.questions) %.% group_by(library.name, month) %.% summarise(research.minutes = sum(est.duration, na.rm =TRUE), research.count=n())
> research.pm$year <- format(as.Date(research.pm$month), "%Y")
> research.pm$month.name <- month(research.pm$month, label = TRUE)
> research.pm
Source: local data frame [150 x 6]
Groups: library.name

   library.name      month research.minutes research.count year month.name
1      Bronfman 2011-02-01             3670            240 2011        Feb
2      Bronfman 2011-03-01             4035            294 2011        Mar
3      Bronfman 2011-04-01             1292             92 2011        Apr
4      Bronfman 2011-05-01              725             59 2011        May
5      Bronfman 2011-06-01              903             77 2011        Jun
6      Bronfman 2011-07-01              554             46 2011        Jul
7      Bronfman 2011-08-01              296             27 2011        Aug
8      Bronfman 2011-09-01             1057             87 2011        Sep
9      Bronfman 2011-10-01             2163            182 2011        Oct
10     Bronfman 2011-11-01             4037            281 2011        Nov
..          ...        ...              ...            ...  ...        ...
> ggplot(research.pm, aes(x=year, y=research.count)) + geom_bar(stat="identity") + facet_grid(library.name ~ month.name) + labs(x="", y="", title="Research questions per month (year to year)") + theme(axis.text.x = element_text(angle = 90))

Slight decline in the numbers of research questions asked each month, year to year. Why? I don’t know. We need to investigate.

(The images look a bit grotty because I shrank them down, but each is a link to a full-size version.)

> ggplot(research.pm, aes(x=year, y=research.minutes/60)) + geom_bar(stat="identity") + facet_grid(library.name ~ month.name) + labs(x="", y="Hours", title="Hours of research support (year to year)") + theme(axis.text.x = element_text(angle = 90))

This shows that sometimes fewer quesions doesn’t mean less time spent helping people with research. But knowing how many questions were asked and how long they all took, it’s trivial to divide to find the average.

> ggplot(research.pm, aes(x=year, y=research.minutes/research.count)) + geom_bar(stat="identity") + facet_grid(library.name ~ month.name) + labs(x="", y="Minutes", title="Average length of research help sessions (year to year)") + theme(axis.text.x = element_text(angle = 90))

Some variation there among the smaller branches, but Scott, which has the most students and is by far the busiest, stays very consistent at 15–20 minutes per research help session.

I explained in another post about “home users,” the number of students that (in theory) use a given library—this should be especially true for research questions—and here I set up a data frame with the branch names and numbers for each year.

> years=c(2011:2014)
> home.users <- rbind(
   data.frame(library.name="Bronfman", academic.year=years, users=c( 6007,  5991,  6050,  6295)),
   data.frame(library.name="Frost",    academic.year=years, users=c( 2655,  2672,  2677,  2671)),
   data.frame(library.name="Scott",    academic.year=years, users=c(33577, 34321, 34388, 33852)),
   data.frame(library.name="Steacie",  academic.year=years, users=c( 9576,  9858, 10018, 10394))
   )
> home.users
   library.name academic.year users
1      Bronfman          2011  6007
2      Bronfman          2012  5991
3      Bronfman          2013  6050
4      Bronfman          2014  6295
5         Frost          2011  2655
6         Frost          2012  2672
7         Frost          2013  2677
8         Frost          2014  2671
9         Scott          2011 33577
10        Scott          2012 34321
11        Scott          2013 34388
12        Scott          2014 33852
13      Steacie          2011  9576
14      Steacie          2012  9858
15      Steacie          2013 10018
16      Steacie          2014 10394

Before we match things up we need to align things by academic year. Academic 2013–2014, which I’d label 2014, started last May (2013-05-01) and ends today (2014-04-30). An easy way to calculate the academic year of a given date is to push it ahead by as many days separate 1 May from 1 January and then use the year of the resulting date. Anything from January to April stays in the same year, but everything later gets knocked ahead a year.

> as.Date("2014-01-01") - as.Date("2013-05-01")
Time difference of 245 days
> as.Date("2014-04-30") + 245
[1] "2014-12-31"
> as.Date("2014-05-01") + 245
[1] "2015-01-01"
> research.pm$academic.year <- as.integer(format(research.pm$month+245, "%Y"))
> Source: local data frame [150 x 7]
Groups: library.name

   library.name      month research.minutes research.count year month.name academic.year
1      Bronfman 2011-02-01             3670            240 2011        Feb          2011
2      Bronfman 2011-03-01             4035            294 2011        Mar          2011
3      Bronfman 2011-04-01             1292             92 2011        Apr          2011
4      Bronfman 2011-05-01              725             59 2011        May          2012
5      Bronfman 2011-06-01              903             77 2011        Jun          2012
6      Bronfman 2011-07-01              554             46 2011        Jul          2012
7      Bronfman 2011-08-01              296             27 2011        Aug          2012
8      Bronfman 2011-09-01             1057             87 2011        Sep          2012
9      Bronfman 2011-10-01             2163            182 2011        Oct          2012
10     Bronfman 2011-11-01             4037            281 2011        Nov          2012
..          ...        ...              ...            ...  ...        ...           ...
> research.pm <- inner_join(research.pm, home.users, by=c("library.name", "academic.year"))
> research.pm
Source: local data frame [150 x 8]
Groups: library.name

   library.name      month research.minutes research.count year month.name academic.year users
1      Bronfman 2011-02-01             3670            240 2011        Feb          2011  6007
2      Bronfman 2011-03-01             4035            294 2011        Mar          2011  6007
3      Bronfman 2011-04-01             1292             92 2011        Apr          2011  6007
4      Bronfman 2011-05-01              725             59 2011        May          2012  5991
5      Bronfman 2011-06-01              903             77 2011        Jun          2012  5991
6      Bronfman 2011-07-01              554             46 2011        Jul          2012  5991
7      Bronfman 2011-08-01              296             27 2011        Aug          2012  5991
8      Bronfman 2011-09-01             1057             87 2011        Sep          2012  5991
9      Bronfman 2011-10-01             2163            182 2011        Oct          2012  5991
10     Bronfman 2011-11-01             4037            281 2011        Nov          2012  5991
..          ...        ...              ...            ...  ...        ...           ...   ...

> ggplot(research.pm, aes(x=year, y=100*research.count/users)) + geom_bar(stat="identity") + facet_grid(library.name ~ month.name) + labs(x="", y="%", title="Home users seen about research") + theme(axis.text.x = element_text(angle = 90))

Assuming each research question is asked by a different person, this shows that each month we see 2-3% of home users about research. Of course, if some people ask more than one question, that ratio is lower. (There is research underway to find out who these users are—the hypothesis is that it isn’t 3% of all students that get research help, but a much higher proportion of a much smaller set of students: who are they, what kind of help do they need, and how can we change what we offer to be of more use?) I’m curious about the situation at other libraries and what their numbers show.

Two years ago I said:

Put those two charts together and it shows that during term time we spend on average about fifteen minutes a week giving research help to each of under one per cent of our students.

Now I say: “During term time each month we give about fifteen minutes of research support to two or three per cent of our students.”

> ggplot(research.pm, aes(x=year, y=100*research.count/users)) + geom_bar(stat="identity") + facet_grid(library.name ~ month.name) + labs(x="", y="%", title="Home users seen about research") + theme(axis.text.x = element_text(angle = 90))

On average, each home user gets under one minute of research support each month. Again, this isn’t how it actually works—a smaller percentage of users get more help, though we don’t know the details yet—but again I’m curious to know how this compares to other libraries.

That’s it for the R and library stats posts for now, I think. Try dplyr, it’s great.

← Open letter to the Toronto Public Library about Hoopla and privacy The Great Eastern →