Miskatonic University Press

Jekyll CO₂ updates

climate.change code jekyll

I updated Jekyll CO₂, my plugin for the web site generator Jekyll, that shows current atmospheric CO₂ data. You can see it in the right-hand bar on this site, down towards the bottom. It looks like this:

Screenshot of Jekyll CO₂ box
Screenshot of Jekyll CO₂ box

It’s showing May data because as I write that is the latest month for which data is available from the NOAA’s Earth System Research Laboratory.

It shows the current CO₂ concentration, the value fifty years ago (year range is configurable), the absolute increase and the percentage change.

I made this plugin in June 2014, at the end of my previous sabbatical, and it was text only. In October that year I updated it to use little sparklines, but over time I realized that just isn’t an effective way of visualizing this data in this context. All it ever showed was a little staircase going up bit by bit. You could hover over a step to see the number, but who’d do that?

Now it tells you the numbers right up front. Comparing to 50 years ago isn’t the best possible, but it works with the data available, and it’s a start.

The raw data is stored in Jekyll’s _data_ directory, where Jekyll sees it and groks it automatically. I don’t use data files yet, but now it’s possible, though I’m not sure how. It would be possible to indicate on every page how much the CO₂ has changed since the page was created, but you’d need to regenerate the site at least once a month to make that reasonable. (That’s the “problem” with static site generators as opposed to WordPress or Drupal, at least without using Javascript to do live content on a page.)

(There’s one improvement needed: fail gracefully if the NOAA’s web site is unavailable. This happens when there’s a government shutdown in the States because they haven’t passed a budget. Mind-boggling.)

See also GHG.EARTH, which is right up to date.

Screenshot of GHG.EARTH
Screenshot of GHG.EARTH

A decade of CPPCAPLA

solidarity york

When I started working at York University Libraries in 2007 I was amazed at how slowly everything went. I came from the private sector: for over a decade I’d worked at internet access providers and a couple of tech startups. Things moved fast there, of course. At York, in the libraries, sometimes it would take months just to talk to someone. A lot was handled with committees, and if someone said they’d have something ready for the committee meeting next month, and then didn’t, there wasn’t much anyone else could do, so we’d wait until the next month. And if there was a plan to do something over the summer to have it ready for September, but there was a problem along the way, the default would be to delay for an entire year to the next September.

Sometimes the slowness is good, certainly. The long time scale universities have helps makes them important institutions in society. Small units can move quickly, and in emergencies like the Covid-19 pandemic things get done very fast. But …

How about a decade? We’re now getting near the end of ten years of work on something I am very confident will soon end with nothing changing.

Divider image of peeling paint on an old door
Divider image of peeling paint on an old door

It’s a document: Criteria and Procedures for Promotion and Continuing Appointments of Professional Librarians and Archivists, which is the tenure and promotion policy for librarians and archivists in the York University Faculty Association, our union.

CPPCAPLA is part of our collective agreement. It was first agreed when librarians joined the union in 1978, and left alone until 2009 when some very minor procedural changes were made. (Meanwhile, the faculty T&P policy was updated and refined in regular bargaining every three years.)

With those changes made there was some appetite for more, because the document was over thirty years old.

Peeling paint
Peeling paint

This is what’s happened:

2010: First try starts A working group is formed to draft some changes to one part of CPPCAPLA, about promotion to Senior Librarian. One colleague says the whole thing should be revised and modelled more along the faculty process, but this is ignored. (She was right, though I didn’t see it at the time; this is what we ended up doing.)

2011: First try ends YUFA rejected our proposed changes and would not take them forward. I think an experienced YUFA staffer agreed we should renegotiate the whole thing, and there was no point in small changes. My notes from back then aren’t nearly as detailed as I keep now, but I did write, “Back it comes to the library and the [union] chapter … cripes, this will take a couple of years by the time it’s done.”

2013: Discussion Some discussion about getting back to the criteria.

2014: Discussion More discussion in the librarian and archivist union chapter. YUFA advises us that one way to handle this would be to get a memorandum of agreement (MOA) in bargaining to set up a side table to negotiate new language. Something this detailed and complex can’t be handled in regular bargaining.

2015: Preparation We get CPPCAPLA revisions into the YUFA primary negotiating positions. In the summer, the chapter sets up Working Group 1 to look at what areas of the document should be changed. In the fall the chapter votes on its recommendations and creates Working Group 2 to draft a new document.

2016: Bargaining The new YUFA collective agreement is ratified. New article 7.10 says: “Within three months of the ratification of this Agreement, the Parties shall name an equal number of representatives to sit on a joint committee to revise the existing Criteria and Procedures for Promotion and Continuing Appointments of Librarians and Archivists. The Joint Committee shall report to the JCOAA every six (6) months or on request from either party and will submit its proposed revisions to the Employer and the Association for approval or ratification.” (Thus setting up a side table through collective agreement language directly, not an MOA.) The bargaining teams form later in the year—I was one of three librarians working with an expert union staffer on the YUFA side of the table.

2017: Bargaining In January the chapter unanimously passes Working Group 2’s draft language. Bargaining begins in March. We have six bargaining sessions that year. The Employer cancels ten others.

2018: Bargaining Bargaining continues (we met off campus during the longest post-secondary strike in Canadian history) with sixteen sessions. The Employer cancels four, we cancel one, and there’s one mutual.

2019: New language voted down Bargaining session #23 in January is the last. We sign a memorandum of settlement with the new language. There is much internal chapter debate about whether three clauses (on criteria for research considered for promotion) are good or not. In March there is a general membership meeting and a ratification vote: vote yes or vote no. The MOS is voted down. In May the chapter meets with some YUFA executive members to discuss next steps. In June, YUFA writes to the Employer setting out three options and asks if any are of interest. We hope there is some way of salvaging something.

2020: Discussion In February the Employer responds to one option with a tentative suggestion of a possible way forward. In September the union chapter will meet with some Exec members to discuss the suggestion. (That’s right: in September 2020 we discuss a response to a proposal sent in June 2019.)

Peeling paint
Peeling paint

There are three potential outcomes of that meeting. The first (everyone likes the Employer’s suggestion) won’t happen. The second will halt everything at that meeting and the third will be rejected by the Employer.

Ten years of work—I can’t imagine how many people put in how many hours on this—that I am regretfully confident will end up achieving nothing. And it will be another decade before there’s any appetite for going back at CPPCAPLA.

(These are my opinions only and I am not speaking for anyone I’ve worked with on anything about CPPCAPLA.)

Peeling paint
Peeling paint

less is more with a better .lessfilter

code unix

I have that complex feeling of being satisfied that I solved a problem but mildly dissatisfied because the problem was very small and I probably spent too long on it, but nevertheless satisfied that a very small part of my life (one hopes for many years to come) will be a tiny bit better.

Was it worth it? The likelihood is greater if I share what I did. So it goes with customizing one’s Unix environment, and life overall.

It all started when I realized that source-highlight doesn’t do syntax highlighting on Markdown files. That was a surprise, because Markdown is so popular. Pygments, on the other hand, does. And it does many more file formats, including JSON.

After a bunch of work, I got it so that if pygmentize is available on the system, it’s used for highlighting files when viewed with less, or if source-highlight is there then it falls back to that, or if neither is available, just plain old less on its own is used. If lessfile is on the system (it’s on all Linux systems, but not on a FreeBSD server I use) then that is mixed in for powerful magic.

In my .bashrc this sets things up:

# lessfile is a nice incantation that lets less open up tar and gz files
# and so on and show you what's inside.
if command -v lessfile > /dev/null 2>&1; then
    eval "$(lessfile)"
    # This sets LESSOPEN and will pick up on ~/.lessfilter.
    # Fall back to do the best we can.
    export LESSOPEN="| ~/.lessfilter %s"

# If any syntax highlighters are available, use them.
# pygmentize does more, but source-highlight is still good.
if command -v pygmentize > /dev/null 2>&1; then
    export LESSCOLOURIZER="pygmentize -f terminal"
elif command -v source-highlight > /dev/null 2>&1; then
    export LESSCOLOURIZER="source-highlight --failsafe --infer-lang -f esc --style-file=esc.style -i"

# Pass through raw ANSI colour escape sequences.  In other words, make colourizing work.
export LESS=' -R '

My .lessfilter has this:

if [ -v LESSCOLOURIZER ]; then
    case "$1" in
	# Could have .bashrc in here, but mine seems to mess it up, perhaps because of escape sequences.
	# pygmentize can handle many more file types; this is just what I want.
	    $LESSCOLOURIZER "$1" ;;
	    if command -v pdftotext > /dev/null 2>&1 ; then pdftotext -layout "$1" -
	    else echo "No pdftotext available; try installing poppler-utils"; fi ;;
	    # Pass through to lessfile
	    exit 1

# Hand whatever is left over to lessfile
exit 1

Now I can use less on binaries like tarballs and zip files (where it shows me a file listing) and PDFs (where it shows me the text), and on text files wherever possible it adds syntax highlighting. On any system where one part of the tool set isn’t available, it will fall back to something simpler, in the end arriving at just plain old unadorned less.

This is part of Conforguration. It will probably work on any Unix-like system (perhaps including macOS, but I don’t know).

Neither source-highlight or Pygments handles Org files, which surprises me. Maybe I can figure out a recipe to fix that.

(I should add that there were a bunch of the Stack Overflow answers and dotfile snippets on GitHub that helped along the way, but I closed all the tabs and don’t have the energy to track them down again. I hope anyone who stumbles across them in the future will also find this.)

One last thing. I also have this in my .bashrc:

alias more='less'

I never run less, I always run more. The habit was set decades ago, and it’s easier to alias than switch.

UPDATE (25 June 2020): Fixed shell redirection error in .lessfilter.

No more tracking

code4lib privacy

Today I upgraded to the latest version of Matomo (moving up from an older version from when it was called Piwik): that’s the open, non-proprietary self-controlled more private equivalent of Google Analytics. The upgrade had been on my to do list for over a year. It didn’t take long, even with the renaming, which meant I needed to change some URLs in Javascript footers that put a tracker on every page.

I got it all working and looked at the fresh Matomo interface. It tells me: not many people look at my web site; the three most popular pages are an out of date post from 2012 (Counting and aggregating in R), Twists, Slugs and Roscoes: A Glossary of Hardboiled Slang and this list of definitions and principles from Ranganathan’s Prolegomena to Library Classification; and Freedom of information request for York University eresource costs completed has had over 400 views since posted two weeks ago, which is very nice to see.

Screenshot of Matomo report on this site
Screenshot of Matomo report on this site

I hadn’t looked at the stats in over a year. I don’t use them. I don’t need them. Why am I tracking users on my site anyway? There is no reason. Becky Yoose and other experts would ask me: Why are you recording personal information you’re not using?

So I turned it off. I went even further: I disabled logging on the web server.

I added a privacy statement to the sidebar: “Zero logging: As of 23 June 2020, no tracking is done on this web site and no logs are kept. I know absolutely nothing about how the site is used.” I also turned off logging on Listening to Art (which I didn’t even know I’d set up: I thought it was like GHG.EARTH and STAPLR, where there’s no tracking).

Matomo is an excellent application! It’s under the GPL, the code is on GitHub, it’s easy to install and use … I like everything about it. I just don’t need it. (And now I don’t have to ever upgrade it again.)

Zero logging is punk.

Freedom of information request for York University eresource costs completed

code4lib emacs fippa libraries york


The data I requested in March 2018 through provincial freedom of information legislation was supplied last month, and the costs paid by York University Libraries for electronic resources in fiscal years 2017 and 2018 are now public: York University Libraries eresource costs (DOI: 10.5683/SP2/K1XCLU). There are three files: the data (extracted by me; available as CSV or in other formats; it is not complete, there are some redactions in what I was given), an R Markdown file with a basic R script to do some simple analysis, and the PDF released to me by York that is the responsive record.

Librarian Bill prepared the data that was released to Civilian Bill, who turned it into a more usable form and gave it back to Librarian Bill to post in York University’s official data repository. Both of us are pleased that this can be added to the list of York University librarian and archivist research outputs, and that it stands as an example of York University Libraries’ commitment to open data.


I first wrote about this on 22 August 2018, in Freedom of information request for York University eresource costs denied.

I’m a librarian at York University Libraries in Toronto. Let’s call me Librarian Bill when I’m there. At home I’m Civilian Bill, and last month Civilian Bill put in a freedom of information request to York University for the amounts the Libraries spent on electronic resources in fiscal years 2017 and 2018. Civilian Bill knew the information exists because Librarian Bill prepared a spreadsheet with precisely those costs.

York has refused to release the data. Their response is “withhold in full.”

I made this request under Ontario’s Freedom of Information and Protection of Privacy Act (FIPPA) because I was inspired by Jane Schmidt’s talk Innovate this! Bullshit in Academic Libraries and What We Can Do About It. She said:

My challenge to all of you here today is to go back to your libraries and start shining a light into the deep recesses of the databases you use…. Do you know how much your library spends on the products you use every day? Are you able to speak confidently on how those prices have fluctuated over time and why they have? If something doesn’t work the way that we think it should or as it is advertised, why is an increase in price—no matter how modest—a given? These are all questions that we need to start asking more consistently. Also, thank you, Simon Fraser University and University of Alberta, for taking the lead on sharing your expenditure data.

In July Civilian Bill filed my request. It was denied. Civilian Bill appealed and eventually won. As reported on 14 March 2019 in Freedom of information appeal for York University eresource costs successful:

Seven months later Civilian Bill and Librarian Bill am very happy to report the data will be released.

York University said in their response:

As a result of mediation with [the mediator] at the Information and Privacy Commission York University would like to suggest a possible resolution to Appeal PA-18-403. York University is committing the resources necessary to schedule the release of this information with a goal of April 30, 2020 for the completion of this project. It is hoped that this will resolve the appeal.

I marked 30 April 2020 in my calendar.

The deadline approaches

Summer and fall of 2019 came and went … the days grew short … winter began … then the days began to lengthen. By February the change was really noticeable. My mood brightened. Spring would be here soon. Finally! And with spring would come the FIPPA response. I waited quietly. Would it happen? It? The final absurd irony?

On 04 March 2020 I received an official email from Patti Ryan, director of Content Development and Analysis (my department, called CDA), working through channels. The email said in part:

I am writing to request your help with doing a final check of the eResources cost data for F2017 and F2018 in order to prepare for their release to the privacy office. Recall that this has been requested from the DLO [Dean of Libraries’ Office] in connection with a freedom of information request, but is also part of CDA’s workplan.

Yes! It happened! Librarian Bill was being asked to prepare the data for release to Civilian Bill! We was overjoyed. If you ever meet one of the Bills in person, let him tell you about this, because I love talking about it.

I responded immediately to confirm that of course I would work on this. This is provincial legislation we’re talking about! And open data! I was pleased to see the work fitted with Goal 1 of the Libraries’ 2016–2020 strategic plan: “Advance the University Community’s Evolving Engagement with Open Scholarship.” It’s great when something you really believe in is part of your institution’s strategic plan.

By early April Librarian Bill finished up a new spreadsheet containing all the data. I was directed to compare the data to the University of Alberta Libraries cost release to double check anything redacted there but not in what I had prepared. The eresources librarian, Aaron Lupton, checked any final missing non-disclosure details with vendors. With all that done, I handed it back through channels and waited.

The deadline passes

The end of April arrived. May began. May continued. I waited. Nothing.

On 11 May Civilian Bill emailed York’s Information and Privacy Office to ask about the status of the release. That email never arrived, but Librarian Bill followed up on 19 May and got a quick response saying the release had been posted earlier in the month, but mail delivery is slow and we could have a PDF by email. I waited over the weekend to see if the envelope would arrive, but it didn’t. On 26 May the response was supplied as a PDF.

The envelope has still not arrived in the post.

The responsive record

This is the PDF I got: Cost_release_data_F2017_F2018.pdf. Here is the first page.

Page 1 of the PDF
Page 1 of the PDF

Civilian Bill was very pleased! To Librarian Bill this was nothing new, of course.

Having this PDF made my work a lot easier, because it’s a live PDF with structured data in side it, not just a static image. Whoever got the spreadsheet I had prepared had turned it into a PDF, and all the columns and rows and cells were still in it. The printed version of this PDF on paper would have required a lot of tedious work scanning and OCRing and cleaning.

Of course, you might ask, Why didn’t they send the spreadsheet? Well, they have their processes in the Information and Privacy Office, and if they deal with PDFs, fair enough. The real question is: Why didn’t York University Libraries release this data back in 2017?

I had a PDF containing easily extracted data, which was going to save me a lot of time, and I would work with it.

Starting to extract the data

I thought that pdf2txt would be the easiest way to get the data out. I’d used it before (so I thought) and it had worked well (so I thought). It’s part of the PDFMiner project, but after a bunch of fiddling I couldn’t get beyond it dumping all the data out in one mixed-up column, which was no good.

Doing it manually seemed to be the only way. I hoped I could copy and paste column by column from the PDF into a file, but that got messed up on most pages because there were some cells with line breaks that made the selection veer over into the next column right. For example, here I’m selecting the F2017 column (second from right) from the bottom up. All fine so far.

Selecting text in a column
Selecting text in a column

But when I get to the line where the title is on two lines inside its cell, the F2018 column (far right) starts getting picked up instead.

Selection moves into the wrong column
Selection moves into the wrong column

Every time this happened I had to treat the row specially. On some the pages this meant a fair bit of fiddly work. I got six pages done one day then put it aside. (I was doing all this in Emacs and Org, which made the work quick, but wait until you see what happened next.)

The day after next I woke up in the middle of the night and thought, “I should use pdf2txt to pull the data out.” Then I remembered I’d tried it and it hadn’t worked. But something wasn’t right. I knew I’d extracted data from PDFs where the page structure was maintained. Aha! That was with pdftotext, an entirely different program, that is part of Poppler! Yes, it is confusing. I hope no one writes pdf2text or pdftotxt.

pdftotext comes with a -layout option:

Maintain (as best as possible) the original physical layout of the text. The default is to ´undo’ physical layout (columns, hyphenation, etc.) and output the text in reading order.

Here’s what it looks like. Skip past the header and notice in the first attempt there’s just one column of output, while in the second there is structure. (I cleaned up spacing to make it more readable.)

$ pdftotext Cost_release_data_F2017_F2018.pdf
$ head Cost_release_data_F2017_F2018.txt
These costs include all e-resources purchased by and licensed to York University Libraries (YUL) for the fiscal periods (May to April) for the
years indicated. Costs indicated are in Canadian dollars paid at time invoice was processed by YUL. Costs are exclusive of taxes.
Where cost information is indicated as “Redacted” for a product, this indicates that a non-disclosure clause prohibits release of cost
information. Where cost information is indicated as “NA”, no costs were incurred for the fiscal year period.
Adam Matthew Digital
Adam Matthew Digital
Adam Matthew Digital
$ pdftotext -layout Cost_release_data_F2017_F2018.pdf
$ head Cost_release_data_F2017_F2018.txt
These costs include all e-resources purchased by and licensed to York University Libraries (YUL) for the fiscal periods (May to April) for the
years indicated. Costs indicated are in Canadian dollars paid at time invoice was processed by YUL. Costs are exclusive of taxes.

Where cost information is indicated as “Redacted” for a product, this indicates that a non-disclosure clause prohibits release of cost
information. Where cost information is indicated as “NA”, no costs were incurred for the fiscal year period.

  vendor              title                      2016-2017   2017-2018
(miscellaneous)       Open Access                       NA       43187
ACM                   ACM Digital Library             5780        6017
Adam Matthew Digital  American Consumer Culture   REDACTED    REDACTED

Now I had a text file with ragged but more or less even columns of data.

Emacs and Org make it easy

I’ve often written about how much I like the text editor Emacs and within it Org mode. (My Emacs configuration files are available if you want to see the details.)

Whenever I’m dealing with text, I use Emacs. If that text (including numbers) is structured as a table, I use Org. Its table editor looks confusing in the documentation, but simple use is a lot easier than it looks, and it’s very powerful and really helpful.

In this case, the best thing about the tables (think: spreadsheets) is that it marks the columns with the pipe symbol (“|”) and if you enter them ragged it will align them to fit. If you start with


And then hit TAB or Ctrl-c, it’ll instantly make it look like this:

| col_one | col_two |
|     101 |     202 |
|     808 | 1000309 |

With the output from pdftotext, I had one text file with fourteen sections (one per original page) of somewhat ragged columns of data. I used the Emacs rectangle commands to add columns of pipe symbols into the raw text, then copy the block of ragged text into an Org table, where it would be nicely formatted automatically.

Here’s what it looks like to start.

Emacs screenshot 1: raw text
Emacs screenshot 1: raw text

Here I’ve added four columns of pipes (using C-x <SPC> to go into rectangle mark mode, which is super cool). They don’t all line up, but that’s OK.

Emacs screenshot 1: raw text
Emacs screenshot 1: raw text

Here I paste all that into an Org file. There’s a blank line between this and the nice-looking table above.

Emacs screenshot 1: raw text
Emacs screenshot 1: raw text

I remove the blank line, hit TAB, and it all aligns.

Emacs screenshot 1: raw text
Emacs screenshot 1: raw text

Beautiful! Then I use M-x org-table-export to write all that to a CSV file.

This is more Emacs information than most people need, but I want to show how powerful it is, and that a multipurpose tool like this can make life easier.


Now that the data was extracted, where should it go? Somewhere reliable … somewhere the data would be available forever, or close enough … somewhere not commercial … somewhere affiliated with York. The answer: the Scholars Portal Dataverse. Depositing your data explains how York researchers can use Dataverse. As it happens, the librarian in charge of York’s Dataverse has her office across from me in the library (back when we were in our offices). Minglu Wang asked a couple of questions and then set me up and sent me a long list of great resources about good data practices. She’s an expert on research data management and I strongly recommend anyone at York with data to preserve get in touch with her.

Librarian Bill now have my own dataverse and within it is the “Eresource costs” dataverse at the nice URL https://dataverse.scholarsportal.info/dataverse/eres.

Some analysis

There’s an R Markdown file in the Dataverse that you can load into RStudio or the like, or you can just copy and paste the lines into an R session. (I do my R sessions inside Emacs with ESS and Org, which you probably predicted.) Here’s some of what it has.

First, load in the tidyverse (install it if it’s not already there).

## install.packages("tidyverse")

Now get the data right out of Dataverse (skipping the step where you have to click to agree to abide by the CC BY license, because I haven’t found out how to turn it off):

## Get the data from the CSV.
raw_costs <- read_csv("https://dataverse.scholarsportal.info/api/access/datafile/105969?format=original")

## Turn it into a better (longer) data structure.
## Replace all NAs with 0s while we're at it.
costs <- raw_costs %>%
  pivot_longer(c("F2017", "F2018"), names_to = "year", values_to = "cost") %>%
  replace_na(list(cost = 0))

## Pick out all the products where the cost is known.
costs_known <- costs %>%
  filter(! cost == "REDACTED") %>%
  mutate(cost = as.numeric(cost))

This takes the “wide” format of the original data and makes it “long” and tidy. Notice how instead of “F2017” and “F2018” columns there’s one column with “year” that has the values of either 2017 or 2018.

> costs
# A tibble: 1,722 x 4
   vendor               title                                  year  cost
   <chr>                <chr>                                  <chr> <chr>
 1 (miscellaneous)      Open Access                            F2017 0
 2 (miscellaneous)      Open Access                            F2018 43187
 3 ACM                  ACM Digital Library                    F2017 5780
 4 ACM                  ACM Digital Library                    F2018 6017
 5 Adam Matthew Digital American Consumer Culture              F2017 REDACTED
 6 Adam Matthew Digital American Consumer Culture              F2018 REDACTED
 7 Adam Matthew Digital American History I                     F2017 114
 8 Adam Matthew Digital American History I                     F2018 122
 9 Adam Matthew Digital American Indian Histories and Cultures F2017 98
10 Adam Matthew Digital American Indian Histories and Cultures F2018 105

With this in hand, we can make a chart showing the vendors paid over $100,000 (Canadian).

## Short list of vendors where YUL spent the most.
major_amount <- 100000
major_amount_pretty <- format(major_amount, big.mark = ",", scientific = FALSE)

major_vendors <- total_vendor_costs %>% filter(total > major_amount) %>% pull(vendor) %>% unique()

major_vendor_costs <- total_vendor_costs %>% filter(vendor %in% major_vendors)

## The reorder function sorts the vendor list by total costs, which
## makes the chart much more readable.
## coord_flip() helps make this kind of chart more readable.

major_vendor_costs %>%
    ggplot(aes(x = reorder(vendor, total), y = total / 1000, fill = year)) +
    geom_col(position = "dodge") +
    geom_label(aes(label = round(total / 1000, -1)),
               position = position_dodge(0.9),
               show.legend = FALSE) +
    coord_flip() +
    labs(title = paste0("York University Libraries eresource costs: vendors paid over $", major_amount_pretty, " total"),
         subtitle = "Does not include all costs because some were redacted",
         x = "",
         y = "$000 (rounded)",
         fill = "",
         caption = "William Denton <wdenton@yorku.ca>, CC BY") +
Chart showing total amount spent on major vendors
Chart showing total amount spent on major vendors

In F2018 we know the Libraries paid Elsevier about $1.57 million. And that’s not including the sixteen products where the prices were redacted! The most expensive product was ScienceDirect—no surprise—at about $1.4 million. The Elsevier F2018 annual report says it had an “adjusted operating profit margin” of 31.3% that year—yes, 31.3%—so of that $1.57 million that we know, $491,000 was pure profit for the company. The Libraries’ collections budget is (in this fiscal year) on the order of $13 million. That means close to 4% of the collections budget goes straight to Elsevier profit.

This is an example of a major issue in scholarly publishing. See SPARC’s Big Deal Cancellation Tracking for more about all this.

Here’s a chart counting redactions by vendor:

costs_redacted <- costs %>% filter(cost == "REDACTED")

costs_redacted %>%
    count(vendor, year) %>%
    ggplot(aes(x = reorder(vendor, n), y = n, fill = year)) +
    geom_col(position = "dodge") +
    coord_flip() +
    labs(title = "York University Libraries eresource costs: vendors with redacted costs",
         subtitle = "Count of products where costs were redacted because of vendor license restrictions",
         x = "",
         y = "$000",
         fill = "",
         caption = "William Denton <wdenton@yorku.ca>, CC BY") +
Chart counting redactions per vendor
Chart counting redactions per vendor

Why the redactions?

Because I thought it would make things go faster to ask for costs where there was no non-disclosure agreement. It didn’t. Along the way I learned that FIPPA doesn’t care about non-disclosure agreements in contracts. But my original request was for costs that didn’t have an NDA, and I let it ride.

What next?

I’m going to file for the eresource costs for F2019 and F2020, of course. With no redactions.


code4lib libraries

There’s that great old quote from Jamie Zawinski (though there’s more behind it):

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

I paraphase:

Some librarians, when confronted with a problem, think “I know, I’ll make a LibGuide.” Now they have two problems.

Combining multiple GPX tracks into one file

code geo

As I mentioned, I use OsmAnd when I’m out on my walks. I record my “tracks”—the path I take as I walk—and the data is stored in GPX files, one per day. I like to see them all at once on the map, so I can see everywhere I’ve walked, but now I have 53 GPX files and turning them all on or off is a pain. I needed some way to combine them all into one.

It turns out GPSBabel does the trick. I wrote this shell script to do the merging. It depends on all the GPX files being in a gpx/ directory.


(for FILE in gpx/*
     echo -n " -f $FILE"
) > files-to-combine.txt

gpsbabel -i gpx -b files-to-combine.txt -o gpx -F covid-walks.gpx

rm files-to-combine.txt

sed -i "s#<name>.*</name>#<name>Covid-19 Walks</name>#" covid-walks.gpx

The -b is for batch processing and makes it easy to specify all the input files. The sed line gives the new file a nice name, instead of being a concatenation of all of the old ones. There may be some way to do this in GPSBabel (it has a very long list of possible filters) but a GPX file is just XML, which is just text, and good old sed will always do you right in such a case.

Copying covid-walks.gpx into /Android/data/net.osmand.plus/files/tracks/rec on my phone (here’s how I mount it) makes it visible to OsmAnd, so I just have the one file that needs visibility turned on or off.

A Visitor for Bear read by Donnie Yen


Donnie Yen read A Visitor for Bear, written by Bonny Becker and illustrated by my mother Kady MacDonald Denton, for Save the Children’s Save with Stories campaign: watch it here. It’s a great reading, and wonderful to hear it in a different language (Cantonese, I think). I just wish they’d showed the illustrations a couple more times!

Screenshot from the video, with Donnie Yen
Screenshot from the video, with Donnie Yen

Emacs refactoring


I spent a while updating my Emacs configuration, and this time nothing went wrong! I’m pleased with the refreshed and refactored setup. Everything looks the same as it did before, except for some colours in the dark Solarized theme, because I’m using a different package for that. Behind the scenes it’s all a lot tidier.

The main change is that everything now depends on John Wiegley’s use-package.

;; If it's not installed, get it.
(unless (package-installed-p 'use-package)
  (package-install 'use-package))

  (require 'use-package))

;; Make sure that if I want a package, it gets installed automatically.
(setq use-package-always-ensure t)
Screenshot of Emacs while I'm editing this post
Screenshot of Emacs while I'm editing this post

With that in place, for every package I want to use, I use use-package to magically get it, install it and configure it. An average example is this, for the Emacs mode to use RuboCop, the Ruby syntax checker that tells you, as you’re writing, when you’ve made a mistake in your Ruby script.

;; Rubocop for pointing out errors (https://github.com/bbatsov/rubocop)
(use-package rubocop
  :diminish rubocop-mode
  (add-hook 'ruby-mode-hook 'rubocop-mode)

The diminish line keeps away some cruft that says “RuboCop mode is on,” which I don’t need reminding; and the add-hook makes it so that whenever the editor is in Ruby mode (i.e., editing a Ruby script) then rubocop-mode is turned on automatically.

I’m now caught up to where a lot of other people were years ago. Next: moving it all into one big Org file!

One thing worth noting is that I fixed some problems I was having where M-x package-autoremove kept removing packages I actually wanted. It turned out the problem was with the variable package-selected-packages, which was introduced in 25.1 and is defined in custom.el. It had a huge list of packages, many of which I don’t want any more, but some I do want weren’t in it. Just one of those things.

I fixed it by brute force. I deleted the line from custom.el, quit Emacs, restarted, and ran M-x package-autoremove. Emacs said, Do you really want to remove all the 57 packages you have installed? I said yes. They got wiped out. I quit again, restarted, and this time use-package installed everything I wanted, updated package-selected-packages, and now everything is working correctly.

Backups and encrypting disks in Linux


I did some maintenance on my backup system this week, and for posterity I’ll document how it works, including how to set up an encrypted hard drive in Linux (in my case, Ubuntu).

Hard drive docks

Backup drive in a dock
Backup drive in a dock

I do backups to hard drives sitting in a dock, which I attach when needed with USB. The dock I have right now is made by Vantec and takes two hard drives. I got it, and the drives, at Canada Computers (which these days, by the way, has a great curbside pickup service).

These docks make it easy to have lots of cheap storage when needed. You can leave drives sitting in them or take them out or swap them around (carefully) as needed and then you just plug it in and there are your terabytes of disk. For backups, where speed doesn’t matter, you can buy slower and cheaper drives. The larger 3.5” drives are cheaper than 2.5” drives that go inside laptops, or, of course, solid state drives.

Backup scripts

There’s a primary backup drive, to which I copy everything, and a mirror, which is an exact copy of the primary. Every week I run my backup scripts to refresh everything on the main drive, and then I refresh the mirror. The primary right now is CRYPT_THREE, and backup scripts look like this one, which backs up my web site from its host:

#!/bin/sh -x



rsync -avz --rsh="ssh -q" --delete --times --progress pair:public_html/miskatonic.org/ "$PAIRBACKUPS/miskatonic.org/"

(Now that I look at it, those rsync options need tidying. --times is redundant, and I like to use GNU-style long option names, so it should be --archive --verbose --compress --delete --progress --rsh="ssh -q".) I don’t think I need the rsh option, though. I’ll fiddle with it. But it works.)

Backing up my laptop looks like this:

#!/bin/sh -x


dpkg --get-selections > ~/backups/marcus-packages.txt


rdiff-backup --verbosity 5         \
             --include /home/wtd/             \
             --include /var/www/              \
             --include /etc/apache2/          \
             --include /etc/hosts             \
             --exclude '**'                   \
             / /media/wtd/${BACKUP_DRIVE}/backups/marcus/

The dpkg command makes a list of all of the packages currently installed, and dump-library.sh does a dump of the database behind my personal library catalogue, Mrs. Abbott.

rdiff-backup does differential backups, taking snapshots of my files at that moment. I can go back and look at things as they were last month or a couple of years ago. All the other backups are just mirroring what’s there on remote machines, but for the laptop where I do everything this means if I need something I deleted years ago I can go back and find it. (Every now and then I wipe out a year’s worth with something like rdiff-backup --remove-older-than 2014-01-01 /media/wtd/CRYPT_THREE/backups/marcus/.)

And now that I look at this I see there’s more in /etc/ I should be backing up, so I’ll tweak that.

All those scripts get everything onto my primary backup drive, currently CRYPT_THREE. This mirrors it to CRYPT_TWO (and again --times is redundant):



rsync --archive --verbose --delete --times --exclude "/lost+found/" /media/wtd/${BACKUP_DRIVE}/ /media/wtd/${MIRROR_DRIVE}/

Making an encrypted drive

Making those encrypted drives takes some special commands. Here’s how I did it when I bought a new 4 TB drive because my 2 TB backups were running out of space. (Don’t ask me how dm-crypt actually works.)

First, I took out the other drives from the hard drive dock. I put the new drive in and turned it on. It’s unformatted, so it didn’t get mounted, but the computer knew it was there and gparted saw it. It was identified as /dev/sdb. I set the partition table type to GPT (GUID Partition Table).

Always make sure you’re formatting and encrypting the new hard drive and not the drive in your laptop with all your files on it! That way madness lies. Scrutinize the /dev/sdb stuff very carefully. My main drive is /dev/sda. A, B, A, B … I always check multiple times before I do anything serious.


sudo cryptsetup luksFormat /dev/sdb

Here I confirmed I know what I’m doing and then entered the passphrase for the disk. I always type the passphrase in by hand (twice) to make sure nothing funny happens from copying and pasting, like a weird end-of-line character or a space sneaking in accidentally.

Then I put a file system on it and named it, in this case CRYPT_THREE.

sudo cryptsetup open /dev/sdb CRYPT_THREE
sudo mkfs.ext4 /dev/mapper/CRYPT_THREE
sudo cryptsetup close CRYPT_THREE
sudo cryptsetup --type luks open /dev/sdb CRYPT_THREE
sudo cryptsetup close CRYPT_THREE

I don’t know where I got that incantation, but it’s copied from somewhere. I don’t know if the close followed directly by an open could be collapsed, but it works and I’m not going to touch it. (On the other hand, there’s nothing to lose when you’re setting up a new hard drive because it’s empty and you can just reformat it, so maybe next time I’ll look into it.)

Now it was ready. I safely unmounted the drive and turned the dock off, then turned it on and checked what happened. The system saw the drive and asked for the passphrase, then mounted it. On Ubuntu, it shows up as /media/wtd/CRYPT_THREE. It works!

Moving drives around

Before I got the new 4 TB drive, I had two 2 TB drives. Call them A and B: A was the primary, B was the mirror. I wanted to make the new drive the primary, and keep the mirror. The steps were:

  • Format the new drive, C.
  • Mirror A to C.
  • Change my scripts so that they back up to C.
  • Run all my scripts to do fresh backups to C.
  • Mirror C to B.
  • Confirm everything works. (If it doesn’t, A is still good to be the primary.)

Now C is my primary backup and B still the mirror. That left me with A, which became my offsite backup. Normally, next I’d take B offsite and bring back A, then mirror C to A and make A my local mirror, and keep switching A and B regularly. However, by then I will have moved everything to 4 TB drives, so I’ll have D and E. I’ll set them up and mirror C to both of them, then use D as my local mirror and take E offsite. When I bring back A that will leave me with A and B as superfluous 2 TB drives. I have some ideas for how to use them, and will write them up if things work out.

Here is my offsite backup tucked into a special protective case before I took it somewhere (obeying all isolation directives) to sit on a quiet shelf.

Backup drive in a protective case
Backup drive in a protective case

List of all blog posts