Counting and aggregating in R

← In Fear We Trust 1 When will Firefox overtake Emacs? →

Counting and aggregating in R

24 September 2012 r

A short post about counting and aggregating in R, because I learned a couple of things while improving the work I did earlier in the year about analyzing reference desk statistics. I’ll post about that soon.

I often want to count things in data frames. For example, let’s say my antimatter equivalent Llib and I have been drinking some repetitive yet oddly priced beverages:

> bevs <- data.frame(cbind(name = c("Bill", "Llib"), drink = c("coffee", "tea", "cocoa", "water"), cost = seq(1:8)))
> bevs$cost <- as.integer(bevs$cost)
> bevs
  name  drink cost
1 Bill coffee    1
2 Llib    tea    2
3 Bill  cocoa    3
4 Llib  water    4
5 Bill coffee    5
6 Llib    tea    6
7 Bill  cocoa    7
8 Llib  water    8

(Note how I specified two names and four drinks, but they repeated themselves to fill up the eight lines to equal the size of the cost sequence I specified. R does that automatically and it’s very useful.)

How many times does each name occur? That’s just basic counting, which is easy with the count function from Hadley Wickham’s excellent plyr package. Now, like a lot of R functions, the count help page is a bit intimidating.

> library(plyr)
> ?count

The help page says:

count                   package:plyr                   R Documentation

Count the number of occurences.

Description:

     Equivalent to ‘as.data.frame(table(x))’, but does not include
     combinations with zero counts.

Usage:

       count(df, vars = NULL, wt_var = NULL)

...

Hmm! But there are examples at the bottom, and like all R documentation, you can just run then and look at what happens, which will probably explain everything. And here are more. How many times does each name occur?

> count(bevs, "name")
  name freq
1 Bill    4
2 Llib    4

How many times did each person drink each drink? To say you want to tally things up by more than one column use the c function to combine things into a vector:

> count(bevs, c("name", "drink"))
  name  drink freq
1 Bill  cocoa    2
2 Bill coffee    2
3 Llib    tea    2
4 Llib  water    2

It’s all pretty easy. Just tell count which data frame you’re using, then which columns you want to tally by, and it does the counting very quickly and efficiently.

How much did I spend in total? How much did I spend on each drink? aggregate does the job for this kind of figuring.

aggregate                package:stats                 R Documentation

Compute Summary Statistics of Data Subsets

Description:

     Splits the data into subsets, computes summary statistics for
     each, and returns the result in a convenient form.

Again, there are good examples at the end of the help file. But for my example, first let’s see how much Llib and I have spent on each kind of drink:

> aggregate(cost ~ name + drink, data = bevs, sum)
  name  drink cost
1 Bill  cocoa   10
2 Bill coffee    6
3 Llib    tea    8
4 Llib  water   12

That command says “I want to apply the sum function to the cost column while aggregating rows based on unique values in the name and drink columns.”

How much did we each spend total? Forgot about aggregating by drink, and just aggregate by name:

> aggregate(cost ~ name, data = bevs, sum)
  name cost
1 Bill   16
2 Llib   20

What was the mean price we paid? Change sum to mean:

> aggregate(cost ~ name, data = bevs, mean)
  name cost
1 Bill    4
2 Llib    5

← In Fear We Trust 1 When will Firefox overtake Emacs? →