A short post about counting and aggregating in R, because I learned a couple of things while improving the work I did earlier in the year about analyzing reference desk statistics. I’ll post about that soon.
I often want to count things in data frames. For example, let’s say my antimatter equivalent Llib and I have been drinking some repetitive yet oddly priced beverages:
> bevs <- data.frame(cbind(name = c("Bill", "Llib"), drink = c("coffee", "tea", "cocoa", "water"), cost = seq(1:8)))
> bevs$cost <- as.integer(bevs$cost)
> bevs
name drink cost
1 Bill coffee 1
2 Llib tea 2
3 Bill cocoa 3
4 Llib water 4
5 Bill coffee 5
6 Llib tea 6
7 Bill cocoa 7
8 Llib water 8(Note how I specified two names and four drinks, but they repeated themselves to fill up the eight lines to equal the size of the cost sequence I specified. R does that automatically and it’s very useful.)
How many times does each name occur? That’s just basic counting, which is easy with the count function from Hadley Wickham’s excellent plyr package. Now, like a lot of R functions, the count help page is a bit intimidating.
> library(plyr)
> ?countThe help page says:
count package:plyr R Documentation
Count the number of occurences.
Description:
Equivalent to ‘as.data.frame(table(x))’, but does not include
combinations with zero counts.
Usage:
count(df, vars = NULL, wt_var = NULL)
...
Hmm! But there are examples at the bottom, and like all R documentation, you can just run then and look at what happens, which will probably explain everything. And here are more. How many times does each name occur?
> count(bevs, "name")
name freq
1 Bill 4
2 Llib 4How many times did each person drink each drink? To say you want to tally things up by more than one column use the c function to combine things into a vector:
> count(bevs, c("name", "drink"))
name drink freq
1 Bill cocoa 2
2 Bill coffee 2
3 Llib tea 2
4 Llib water 2It’s all pretty easy. Just tell count which data frame you’re using, then which columns you want to tally by, and it does the counting very quickly and efficiently.
How much did I spend in total? How much did I spend on each drink? aggregate does the job for this kind of figuring.
aggregate package:stats R Documentation
Compute Summary Statistics of Data Subsets
Description:
Splits the data into subsets, computes summary statistics for
each, and returns the result in a convenient form.
Again, there are good examples at the end of the help file. But for my example, first let’s see how much Llib and I have spent on each kind of drink:
> aggregate(cost ~ name + drink, data = bevs, sum)
name drink cost
1 Bill cocoa 10
2 Bill coffee 6
3 Llib tea 8
4 Llib water 12That command says “I want to apply the sum function to the cost column while aggregating rows based on unique values in the name and drink columns.”
How much did we each spend total? Forgot about aggregating by drink, and just aggregate by name:
> aggregate(cost ~ name, data = bevs, sum)
name cost
1 Bill 16
2 Llib 20What was the mean price we paid? Change sum to mean:
> aggregate(cost ~ name, data = bevs, mean)
name cost
1 Bill 4
2 Llib 5
Miskatonic University Press