Early this afternoon I was walking south down a major street in Toronto near where I live. A red car, southbound, appears from behind me, pulls to the side, and parks a short ways ahead of me. A grey-haired woman, about 60, short and squat, gets out from the passenger side. She is wearing a long wool coat and a longer dress or skirt, and has a long scarf around her neck. She comes towards me on the sidewalk, walking in the opposite direction to her previous travel. I assume she is going to one of the apartment buildings along the street.
When she is very near, she looks at me and says, “Do you speak Polish?”
She nods. “Thank you.” Then she walks back to the small red car and it drives off.
I have been unable to come up with any reasonable explanation for this. Do I look, from behind, like I speak Polish? What if I had said yes? Perhaps some things are best left unknown.
In Org clocktables I: The daily structure I explained how I track my time working at an academic library, clocking in to projects that are either categorized as PPK (“professional performance and knowledge,” our term for “librarianship,”), PCS (“professional contributions and standing”, which covers research, professional development and the like) and Service. I do this by checking in and out of tasks with the magic of Org.
I’ll add a day to the example I used before, to make it more interesting. This is what the raw text looks like:
All raw Org text looks ugly, especially all those LOGBOOK and PROPERTIES drawers. Don’t let that put you off. This is what it looks like on my screen with my customizations (see my .emacs for details):
At the bottom of the month I use Org’s clock table to summarize all this.
I just put in the BEGIN/END lines and then hit C-c C-c and Org creates that table. Whenever I add some more time, I can position the pointer on the BEGIN line and hit C-c C-c and it updates everything.
Now, there are lots of commands I could use to customize this, but this is pretty vanilla and it suits me. It makes it clear how much time I have down for each day and how much time I spent in each of the three pillars. It’s easy to read at a glance. I fiddled with various options but decided to stay with this.
It looks like this on my screen:
That’s a start, but the data is not in a format I can use as is. The times are split across different columns, there are multiple levels of indents, there’s a heading and a summation row, etc. But! The data is in a table in Org, which means I can easily ingest it and process it in any language I choose, in the same Org file. That’s part of the power of Org: it turns raw data into structured data, which I can process with a script into a better structure, all in the same file, mixing text, data and output.
Which language, though? A real Emacs hacker would use Lisp, but that’s beyond me. I can get by in two languages: Ruby and R. I started doing this in Ruby, and got things mostly working, then realized how it should go and what the right steps were to take, and switched to R.
Here’s the plan:
ignore “Headline” and “Total time” and “2017-12 December” … in fact, ignore everything that doesn’t start with “\_”
clean up the remaining lines by removing “\_”
the first line will be a date stamp, with the total day’s time in the first column, so grab it
after that, every line will either be a PPK/PCS/Service line, in which case grab that time
or it will be a new date stamp, in which case capture that information and write out the previous day’s information
continue on through all the lines
until the end, at which point a day is finished but not written out, so write it out
I did this in R, using three packages to make things easier. For managing the time intervals I’m using hms, which seems like a useful tool. It needs to be a very recent version to make use of some time-parsing functions, so it needs to be installed from GitHub. Here’s the R:
All of that is in a SRC block like below, but I separated the two in case it makes the syntax highlighting clearer. I don’t think it does, but such is life. Imagine the above code pasted into this block:
Running C-c C-c on that will produce no output, but it does create an R session and set up the function. (Of course, all of this will fail if you don’t have R (and those three packages) installed.)
With that ready, now I can parse that monthly clocktable by running C-c C-c on this next source block, which reads in the raw clock table (note the var setting, which matches the #+NAME above), parses it with that function, and outputs cleaner data. I have this right below the December clock table.
That’s what I wanted. The code I wrote to generate it could be better, but it works, and that’s good enough.
Notice all of the same dates and time durations are there, but they’re organized much more nicely—and I’ve added “lost.” The “lost” count is how much time in the day was unaccounted for. This includes lunch (maybe I’ll end up classifying that differently), short breaks, ploughing through email first thing in the morning, catching up with colleagues, tidying up my desk, falling into Wikipedia, and all those other blocks of time that can’t be directly assigned to some project.
My aim is to keep track of the “lost” time and to minimize it, by a) not wasting time and b) properly classifying work. Talking to colleagues and tidying my desk is work, after all. It’s not immortally important work that people will talk about centuries from now, but it’s work. Not everything I do on the job can be classified against projects. (Not the way I think of projects—maybe lawyers and doctors and the self-employed think of them differently.)
The one technical problem with this is that when I restart Emacs I need to rerun the source block with the R function in it, to set up the R session and the function, before I can rerun the simple “update the monthly clocktable” block. However, because I don’t restart Emacs very often, that’s not a big problem.
The next stage of this is showing how I summarize the cleaned data to understand, each month, how much of my time I spent on PPK, PCS and Service. I’ll cover that in another post.
At work I’m analysing usage of ebooks, as reported by vendors in COUNTER reports. The Excel spreadsheet versions are ugly but a little bit of R can bring them into the tidyverse and give you nice, clean, usable data that meets the three rules of tidy data:
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
There are two kinds of COUNTER reports for books: BR1 (“Number of Successful Title Requests by Month and Title”) counts how many times people looked at a book and BR2 (“Number of Successful Section Requests by Month and Title”) counts how many times they look at a part (like a chapter) of a book. The reports are formatted in the same human-readable way, so this code works for both, but be careful to handle them separately.
They start with seven lines of metadata about the report, and then you get the actual data. There are a few required columns, one of which is the title of the book, but that column doesn’t have a heading! It’s blank! Further to the right are columns for each month of the reporting period. Rows are for books or sections, but there is also a “Total for all titles” row that sums them all up.
This formatting is human-readable but terrible for machines. Happily, that’s easy to fix.
yulr, my own package of some little helper functions. If you want to use it you’ll need to install it specially, as explained in its documentation.
As it happens the COUNTER reports are all in one Excel spreadsheet, organized by sheets. Brill’s 2014 report is in the sheet named “Brill 2014,” so I need to pick it out and work on it. The flow is:
load in the sheet, skipping the first seven lines (including the one that tells you if it’s BR1 or BR2)
cut out columns I don’t want with a minus select
use gather to reshape the table by moving the month columns to rows, where the month name ends up in a column named “month;” the other fields that are minus selected are carried along unchanged
rename two columns
reformat the month name into a proper date, and rename the unnamed title column (which ended up being called X__1) while truncating it to 50 characters
filter out the row that adds up all the numbers
reorder the columns for human viewing
Looking at this I think that date mutation business may not always be needed, but some of the date formatting I had was wonky, and this made it all work.
That line above just works for one year. I had four years of Brill data, and didn’t want to repeat the long line for each, because if I ever need to make a change I’d have to make it four times and if I missed one there’d be a problem. This is the time to create a function. Now my code looks like this:
That looks much nicer in Emacs (in Org, of course):
I have similar functions for other vendors. They are all very similar, but sometimes a (mandatory) Book DOI field or something else is missing, so a little fiddling is needed. Each vendor’s complete data goes into its own tibble, which I then glue together. Then I delete all the rows where no month is defined (which, come to think of it, I should investigate to make sure these aren’t being introduced by some mistake I made in reshaping the data), I add the ayear column so I can group things by academic year, and where usage of a book in a given month is 0, I make it 0 instead of NA.
The data now looks like this (truncating the title even more for display here):
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1343899 obs. of 7 variables:
$ month : Date, format: "2014-01-01" "2014-01-01" "2014-01-01" "2014-01-01" ...
$ usage : num 0 0 0 0 0 0 0 0 0 0 ...
$ ISBN : chr "9789004216921" "9789047427018" "9789004222656" "9789004214149" ...
$ platform : chr "BOPI" "BOPI" "BOPI" "BOPI" ...
$ publisher: chr "Brill" "Brill" "Brill" "Brill" ...
$ title : chr "A Commentary on the United Nations Convention on t" "A Wandering Galilean: Essays in Honour of Seán Fre" "A World of Beasts: A Thirteenth-Century Illustrate" "American Diplomacy" ...
$ ayear : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
The other day I saw a link from NPR that includes a video the band posted that trims the 70-odd minute long performance to a delightful 7.5 minutes, with the band waking up and moving through a day.
If you have access to a streaming music service, there will probably be several performances of “In C” available. If you like one, try them all. It’s not to everyone’s taste, but I think it’s an incredible composition and every performance I’ve heard has had at minimum a good dose of magic in it.
I do not recommend this book to anyone or for any collection. I don’t normally post negative reviews of books, but I saw so many errors in this one that I feel people need to know.
I was concerned as soon as I started reading chapter 1, but “1.7 Open Source Software” was where I got really worried:
The term open source often refers to something that can be modified because its gate is publicly accessible. In the context of software, open source means that the software code can be modified or enhanced by anyone. The open source movement began in the late 1970s, when two separate organizations promoted the idea of software that is available for anyone to use or modify. The first organization that aimed to create a free operating system was General Public License (GPL). The leading person behind this movement was Richard Stallman. The second organization was Open Source Initiative (OSI), under the leadership of Bruce Perens and Eric S. Raymond.
“Its gate”? Further: the GPL is a license, not an organization. RMS has been working on and for free software (there’s a difference) since the seventies, but the Free Software Foundation wasn’t created until 1985. The Open Source Initiative began in 1998.
The next paragraph confuses R with its predecessor, S. The third paragraph begins:
R is similar to other programming languages, such as C, Java, and Perl, in that is helps people perform a wide variety of computing tasks by giving them access to various commands.
I don’t understand how that sentence could be written by anyone that actually programs.
Skipping ahead past the very introductory statistics stuff, which is confusing, let’s look at “4.3 Introduction of Basic Functionality in R.” It will mislead any reader.
For example on page 53, “4.3.2 Writing Functions” begins:
When you write an R function there are two things you should keep in mind: the arguments and the return value.
Certainly true! True in any language. This is not the time to introduce functions, however. It’s too early.
The book then gives this example:
In reality this will look like:
There are many, many code snippets in the book where the output is wrong and the formatting bad.
The section concludes:
We will encounter functions throughout the book. For example, there is a function named “+” that does simple addition. However, there are perhaps two thousand or so functions built in to R, many of which never get called by the user directly but serve to help out other functions.
This is the reader’s first introduction to functions!
“4.3.4 The Return Value” says:
In R, the return value is a function that has exactly one return value. If you need to return more than one thing, you’ll need to make a list (or possibly a matrix, data.frame, or table display). We will discuss matrix, data.frame, and table display in chapter 17.
That first sentence is incorrect, and the section is utterly unhelpful.
Moving to the next section, let’s look at a few examples from “4.4 Introduction to Variables in R.”
On page 54, it says, “In any programming language, we often encounter seven types of data,” namely numeric (“decimal values, also known as numeric values”), integers, strings, characters, factors, fractions (“represents a part of a whole or any number with equal parts”) and “logical.” I offer this as an essay question in first-year computer science exams: “‘In any programming language, we often encounter seven types of data.’ Discuss.”
On page 55 there’s discussion of assigning variables. It says, “The most common operator assignment is <- and ==, the first assignment being the preferred way.” == is the equality test! This should be =!
Then assign() is introduced, though surely no beginning R user needs to know about it, and it’s introduced incorrectly. Here’s the example:
What is that trailing 4 doing there? I don’t know. Also, j is being made a string, but it’s shown here as an integer. The example should look like:
Next it says variable names can contain underscores, which is correct and certainly useful to know, but the example won’t work. This is what it shows:
This is what happens:
That’s because 38_a is not a valid variable name. A bit of looking around turns up the documentation in ?make.names, which says, “A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ‘”.2way”’ are not valid, and neither are the reserved words.” So a_38 is valid, but not 38_a.
Even if the example used a_38 it still wouldn’t work, because that variable hasn’t been defined yet:
Moving on to page 57, about characters, the book says, “A character is used to represent a single element such as a letter of the alphabet or punctuation. In order to convert objects into character values in R, we need to declare this value as character with the as.character() function.” The example is:
That code works and is correct, but why change to using = for assignment instead of the usual <-? Also, “0.14” is not a “single element”! Also, there’s no point in using as.character there, because it’s unnecessary; the function could be introduced later when one needs to convert some other data type to a string.
Furthermore, strings aren’t actually a different class:
In the section on fractions we see that fractions aren’t actually built into R, because they require a special package to use them. The instructions on how to install that package are incorrect and will cause an error. Somebody must have noticed that because fractions have disappeared in the version on the web site. “‘In any programming language, we often encounter six types of data.’ Discuss.”
The section on logical variables says, “The logical value makes a comparison between variables and provides additional functionality by adding/subtracting/etc.” What?
This is the example (which should use <- for assignment):
What? That makes no sense. This is what happens when you run it:
That runs and is correct, but if this was the original intent, how on earth did it get mangled into what’s in the book? Why not say 1 > 2 and see what happens?
All of that is just between pages 53–59 where the book is introducing the most very basic aspects of a programming language. I didn’t go further.
What little I read of basic statistics was confusing and unhelpful. I didn’t bother to go further on that either.
The web site has different code on it, but I don’t see any notices about errata or corrections.
Some of the many faults of the book could have been fixed by using methods of reproducible research. It’s possible to write a book mixing text and code in R and Markdown. Hadley Wickham and Garrett Grolemund did this in R for Data Science, an excellent book, and all of the source code is openly available.
Anyone in LIS looking to learn statistics and R is advised to look elsewhere. I will post recommendations as I find better books.
Recently I started tracking my time at work using Org’s clocking feature, and it’s working out very well. The actual act of tracking my time makes me much more focused, and I’m wasting less time and working more on important, planned, relevant things. It’s also helping me understand how much time I spend on each of the three main pillars of my work (librarianship + research and professional development + service). In order to understand all this well I wrote some code to turn the Org’s clocktable into something more usable. This is the first of two or three posts showing what I have.
In York University Libraries, where I work, librarians and archivists have academic status. We are not faculty (that’s the professors), but we’re very similar. We’re in the same union. We have academic freedom (important). We get “continuing appointment,” not tenure, but the process is much the same.
University professors have three pillars to their work: teaching, research and service. Service is basically work that contributes to the running of the university: serving on committees (universities have lots of committees, and they do important work, like vetting new courses and programs, allocating research funds, or deciding who gets tenure), being on academic governance bodies such as faculty councils and Senate, having a position in the union, etc. Usually there’s a 40/40/20 ratio on these three areas: people spend about 40% of their time on teaching, 40% on research and 20% on service. This fluctuates term to term and year to year—and person to person—but that’s the general rule in most North American universities, as I understand it.
For librarians and archivists the situation can be different. Instead of teaching, let’s say we do “librarianship” as a catch-all term. (Or “archivy,” which the archivists assure me is a real word, but I still think it looks funny.) Then we also do professional development/research and service. In some places, like Laurentian, librarians have full parity with professors, and they have the 40/40/20 ratio. That is ideal. A regrettable example is Western, where librarians and archivists have to spend 75% of their time on professional work. That severely limits the contributions they can make both to the university and to librarianship and scholarship in general.
At York there is no defined ratio. For professors it’s understood to be the 40/40/20, but for librarians and archivists I think it’s understood that is not our ratio, but nothing is set out instead. (This, and that profs have a 2.5 annual course teaching load but we do not have an equivalent “librarianship load,” leads to problems.)
I have an idea of what the ratio should be, but I’m not going to say it here because this may become a bargaining issue. I didn’t know if my work matched that ratio because I don’t have exact details about how I spend my time. I’ve been doing a lot of service, but how much? How much of my time is spent on research?
This question didn’t come to me on my own. A colleague started tracking her time a couple of months ago, jotting things down each day. She said she hadn’t realized just how much damned time it takes to schedule things. I was inspired by her to start clocking my own time.
This is where I got to apply an aspect of Org I’d read about but never used. Org is amazing!
I keep a file, work-diary.org, where I put notes on everything I do. I changed how I use subheadings and now I give every day this structure:
“PPK” is “professional performance and knowledge,” which is our official term for “librarianship” or “archivy.” “PCS” is “professional contribution and standing,” which is the umbrella term for research and more for faculty. Right now for us that pillar is called “professional development,” but that’s forty-year-old terminology we’re trying to change, so I use the term faculty use. (Check the T&P criteria for a full explanation.)
First thing in the morning, I create that structure, then under the date heading I run C-u C-u C-c C-x C-i (where C-c means Ctrl-c). Now, I realize that’s a completely ridiculous key combination to exist, but when you start using Emacs heavily, you get used to such incantations and they become second nature. C-c C-x C-i is the command org-clock-in. As the docs say, “With two C-uC-u prefixes, clock into the task at point and mark it as the default task; the default task will then always be available with letter d when selecting a clocking task.” That will make more sense in a minute.
When I run that command, Org adds a little block under the heading:
The clock is running, and a little timer shows up in my mode line that tells me how long I’ve been working on the current thing.
I’ll spend a while deleting email and checking some web sites, then let’s say I decide to respond to an email about reference desk statistics, because I can get it done before I have to head over to a 10:30 meeting. I make a new subheading under PPK, because this is librarianship work, and clock into it with C-c C-x C-i. The currently open task gets closed, the duration is noted, and a new clock starts.
(Remember this doesn’t look ugly the way I see it in Emacs. There’s another screenshot below.)
I work on that until 10:15, then I make a new task (under Service) and check into it (again with C-c C-x C-i). I’m going to a
monthly meeting of the union’s stewards’ council, and walking to the meeting and back counts as part of the time spent. (York’s campus is pretty big.)
The meeting ends at 1, and I head back to my office. Lunch was provided during the meeting (probably pizza or extremely bready sandwiches, but always union-made), so I don’t take a break for that. In my office I’m not ready to immediately settle into a task, so I hit C-u C-c C-x C-i (just the one prefix), which lets me “select the task from a list of recently clocked tasks.” This is where the d mentioned above comes in: a little list of recent tasks pops up, and I can just hit d to clock into the [2017-12-01 Fri] task.
Now I might get a cup of tea if I didn’t pick one up on the way, or check email or chat with someone about something. My time for the day is accruing, but not against any specific task. Then, let’s say it’s a focused day, and I settle in and work until 4:30 on a project about ebook usage. I clock in to that, then when I’m ready to leave I clock out of it with C-c C-x C-o.
In Emacs, this looks much more appealing.
That’s one day of time clocked. In my next post I’ll add another day and a clocktable, and then I’ll show the code I use to summarize it all into tidy data.
I’m doing all this for my own use, to help me be as effective and efficient and aware of my work habits as I can be. I want to spend as much of my time as I can working on the most important work. Sometimes that’s writing code, sometimes that’s doing union work, sometimes that’s chatting with a colleague about something that’s a minor thing to me but that takes an hour because it’s important to them, and sometimes that’s watching a student cry in my office and waiting for the moment when I can tell them that as stressful as things are right now it’s going to get better. (Women librarians get much, much more of this than men do, but I still get some. It’s a damned tough thing, doing a university degree.) I recommend my colleague Lisa Sloniowski’s award-winning article Affective Labor, Resistance, and the Academic Librarian (Library Trends Vol. 64, No. 4, 2016) for a serious look at all this.
In S02E04 of Riverdale, “The Town That Dreaded Sundown,” the serial killer is on the loose and Jughead goes to the public library to get some books on the subject. The library is a traditional TV show library, and the librarian is the stereotypical prim grey-haired woman. However, there’s a twist: her name is Mrs. Paroo. She’s named after Marion Paroo, “Marian the Librarian,” from The Music Man, which was set in River City.
Later Betty goes to the library to get some books on ciphers and cryptograms, but we don’t see Mrs. Paroo, perhaps because the library seems to be closed. How they got in, I don’t know, but the kids seem to be able to get into any building they want.
My theory about Riverdale is that it’s actually a soap opera set inside the Archie universe. There are a number of stories in the original comics where actors come to Riverdale and play characters based on Archie, Veronica and Betty. (This always leads to hilarity and confusion.) Riverdale is one of them—it exists inside the Archie universe—and I think Archie and the others are watching the show in confused amazement.