A short post about counting and aggregating in R, because I learned a couple of things while improving the work I did earlier in the year about analyzing reference desk statistics. I’ll post about that soon.
I often want to count things in data frames. For example, let’s say my antimatter equivalent Llib and I have been drinking some repetitive yet oddly priced beverages:
> bevs <- data.frame(cbind(name = c("Bill", "Llib"), drink = c("coffee", "tea", "cocoa", "water"), cost = seq(1:8)))
> bevs$cost <- as.integer(bevs$cost)
> bevs
name drink cost
1 Bill coffee 1
2 Llib tea 2
3 Bill cocoa 3
4 Llib water 4
5 Bill coffee 5
6 Llib tea 6
7 Bill cocoa 7
8 Llib water 8
(Note how I specified two names and four drinks, but they repeated themselves to fill up the eight lines to equal the size of the cost sequence I specified. R does that automatically and it’s very useful.)
How many times does each name occur? That’s just basic counting, which is easy with the count
function from Hadley Wickham’s excellent plyr
package. Now, like a lot of R functions, the count
help page is a bit intimidating.
> library(plyr)
> ?count
The help page says:
count package:plyr R Documentation
Count the number of occurences.
Description:
Equivalent to ‘as.data.frame(table(x))’, but does not include
combinations with zero counts.
Usage:
count(df, vars = NULL, wt_var = NULL)
...
Hmm! But there are examples at the bottom, and like all R documentation, you can just run then and look at what happens, which will probably explain everything. And here are more. How many times does each name occur?
> count(bevs, "name")
name freq
1 Bill 4
2 Llib 4
How many times did each person drink each drink? To say you want to tally things up by more than one column use the c
function to combine things into a vector:
> count(bevs, c("name", "drink"))
name drink freq
1 Bill cocoa 2
2 Bill coffee 2
3 Llib tea 2
4 Llib water 2
It’s all pretty easy. Just tell count
which data frame you’re using, then which columns you want to tally by, and it does the counting very quickly and efficiently.
How much did I spend in total? How much did I spend on each drink? aggregate
does the job for this kind of figuring.
aggregate package:stats R Documentation
Compute Summary Statistics of Data Subsets
Description:
Splits the data into subsets, computes summary statistics for
each, and returns the result in a convenient form.
Again, there are good examples at the end of the help file. But for my example, first let’s see how much Llib and I have spent on each kind of drink:
> aggregate(cost ~ name + drink, data = bevs, sum)
name drink cost
1 Bill cocoa 10
2 Bill coffee 6
3 Llib tea 8
4 Llib water 12
That command says “I want to apply the sum function to the cost column while aggregating rows based on unique values in the name and drink columns.”
How much did we each spend total? Forgot about aggregating by drink, and just aggregate by name:
> aggregate(cost ~ name, data = bevs, sum)
name cost
1 Bill 16
2 Llib 20
What was the mean price we paid? Change sum
to mean
:
> aggregate(cost ~ name, data = bevs, mean)
name cost
1 Bill 4
2 Llib 5