Miskatonic University Press

COUNTER data made tidy

code4lib libraries r

At work I’m analysing usage of ebooks, as reported by vendors in COUNTER reports. The Excel spreadsheet versions are ugly but a little bit of R can bring them into the tidyverse and give you nice, clean, usable data that meets the three rules of tidy data:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

There are two kinds of COUNTER reports for books: BR1 (“Number of Successful Title Requests by Month and Title”) counts how many times people looked at a book and BR2 (“Number of Successful Section Requests by Month and Title”) counts how many times they look at a part (like a chapter) of a book. The reports are formatted in the same human-readable way, so this code works for both, but be careful to handle them separately.

Fragment of COUNTER report
Fragment of COUNTER report

They start with seven lines of metadata about the report, and then you get the actual data. There are a few required columns, one of which is the title of the book, but that column doesn’t have a heading! It’s blank! Further to the right are columns for each month of the reporting period. Rows are for books or sections, but there is also a “Total for all titles” row that sums them all up.

This formatting is human-readable but terrible for machines. Happily, that’s easy to fix.

First, in R, load in some packages:

  • the basic set of tidyverse packages;
  • readxl, to read Excel spreadsheets;
  • lubridate, to manipulate dates; and
  • yulr, my own package of some little helper functions. If you want to use it you’ll need to install it specially, as explained in its documentation.
library(tidyverse)
library(readxl)
library(lubridate)
library(yulr)

As it happens the COUNTER reports are all in one Excel spreadsheet, organized by sheets. Brill’s 2014 report is in the sheet named “Brill 2014,” so I need to pick it out and work on it. The flow is:

  • load in the sheet, skipping the first seven lines (including the one that tells you if it’s BR1 or BR2)
  • cut out columns I don’t want with a minus select
  • use gather to reshape the table by moving the month columns to rows, where the month name ends up in a column named “month;” the other fields that are minus selected are carried along unchanged
  • rename two columns
  • reformat the month name into a proper date, and rename the unnamed title column (which ended up being called X__1) while truncating it to 50 characters
  • filter out the row that adds up all the numbers
  • reorder the columns for human viewing
brill_2014 <- read_xlsx("eBook Usage.xlsx", sheet = "Brill 2014", skip = 7)
%>% select(-ISSN, -`Book DOI`, -`Proprietary Identifier`, -`Reporting Period Total`)
%>% gather(month, usage, -X__1, -ISBN, -Publisher, -Platform)
%>% rename(platform = Platform, publisher = Publisher)
%>% mutate(month = floor_date(as.Date(as.numeric(month), origin = "1900-01-01"), "month"), title = substr(X__1, 1, 50))
%>% filter(! title == "Total for all titles") %>% select(month, usage, ISBN, platform, publisher, title)

Looking at this I think that date mutation business may not always be needed, but some of the date formatting I had was wonky, and this made it all work.

That line above just works for one year. I had four years of Brill data, and didn’t want to repeat the long line for each, because if I ever need to make a change I’d have to make it four times and if I missed one there’d be a problem. This is the time to create a function. Now my code looks like this:

counter_parse_brill <- function (x) {
  x %>% select(-ISSN, -`Book DOI`, -`Proprietary Identifier`, -`Reporting Period Total`) %>% gather(month, usage, -X__1, -ISBN, -Publisher, -Platform) %>% rename(platform = Platform, publisher = Publisher) %>% mutate(month = floor_date(as.Date(as.numeric(month), origin = "1900-01-01"), "month"), title = substr(X__1, 1, 50)) %>% filter(! title == "Total for all titles") %>% select(month, usage, ISBN, platform, publisher, title)
}

brill_2014 <- read_xlsx("eBook Usage.xlsx", sheet = "Brill 2014", skip = 7) %>% counter_parse_brill()
brill_2015 <- read_xlsx("eBook Usage.xlsx", sheet = "Brill 2015", skip = 7) %>% counter_parse_brill()
brill_2016 <- read_xlsx("eBook Usage.xlsx", sheet = "Brill 2016", skip = 7) %>% counter_parse_brill()
brill_2017 <- read_xlsx("eBook Usage.xlsx", sheet = "Brill 2017", skip = 7) %>% counter_parse_brill()
brill <- rbind(brill_2014, brill_2015, brill_2016, brill_2017)

That looks much nicer in Emacs (in Org, of course):

R in Org
R in Org

I have similar functions for other vendors. They are all very similar, but sometimes a (mandatory) Book DOI field or something else is missing, so a little fiddling is needed. Each vendor’s complete data goes into its own tibble, which I then glue together. Then I delete all the rows where no month is defined (which, come to think of it, I should investigate to make sure these aren’t being introduced by some mistake I made in reshaping the data), I add the ayear column so I can group things by academic year, and where usage of a book in a given month is 0, I make it 0 instead of NA.

ebook_usage <- rbind(brill, ebl, ebook_central, iet, scholars_portal, spie)

ebook_usage <- ebook_usage %>% filter(! is.na(month))
ebook_usage <- ebook_usage %>% mutate(ayear = academic_year(month))
ebook_usage$usage[is.na(ebook_usage$usage)] <- 0

The data now looks like this (truncating the title even more for display here):

month usage ISBN platform publisher title ayear
2014-01-01 0 9789004216921 BOPI Brill A Comme 2013
2014-01-01 0 9789047427018 BOPI Brill A Wande 2013
2014-01-01 0 9789004222656 BOPI Brill A World 2013
> str(ebook_usage)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	1343899 obs. of  7 variables:
 $ month    : Date, format: "2014-01-01" "2014-01-01" "2014-01-01" "2014-01-01" ...
 $ usage    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ISBN     : chr  "9789004216921" "9789047427018" "9789004222656" "9789004214149" ...
 $ platform : chr  "BOPI" "BOPI" "BOPI" "BOPI" ...
 $ publisher: chr  "Brill" "Brill" "Brill" "Brill" ...
 $ title    : chr  "A Commentary on the United Nations Convention on t" "A Wandering Galilean: Essays in Honour of Seán Fre" "A World of Beasts: A Thirteenth-Century Illustrate" "American Diplomacy" ...
 $ ayear    : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...

The data is now ready for analysis.