More #accessyul tweet chart hacks

21 October 2012 r

I got curious about what else I might pull out of that set of tweets I had with a bit of basic R hacking, just for fun, while I’m on a train watching the glories of autumnal Ontario pass by.

What does the data frame look like? These are the column names:

> colnames(y)
 [1] "text"         "favorited"    "replyToSN"    "created"      "truncated"
 [6] "replyToSID"   "id"           "replyToUID"   "statusSource" "screenName"

The rows are too wide to show, but it's easy to pick out the screenName values, which are the Twitter usernames:

> head(y$screenName)
[1] djfiander    MarkMcDayter shlew        jordanheit   ENOUSERID
[6] AccessLibCon
197 Levels: mjgiarlo anarchivist adr elibtronic weelibrarian ... MarkMcDayter

197 different usernames among the 1500 tweets. Who tweeted the most? Is there some kind of power law going on, where a few people tweeted a lot and a lot of people tweeted very little?

> tweeters.count <- count(y, "screenName")
> head(tweeters.count)
    screenName freq
1     mjgiarlo    3
2  anarchivist    2
3          adr   88
4   elibtronic   11
5 weelibrarian   53
6  TheRealArty   57
> ggplot(tweeters.count) + aes(x = screenName, y = freq) + geom_bar(stat="identity")

Unsorted tweet counts

That's not very nice: it's not sorting by number of tweets. I send this chart back to the nether regions whence it sprang!

The chart looks that way because the bars in ggplot2 are ordered by the ordering of the levels in the factor.

> head(factor(tweeters.count$screenName))
[1] mjgiarlo     anarchivist  adr          elibtronic   weelibrarian
[6] TheRealArty
197 Levels: mjgiarlo anarchivist adr elibtronic weelibrarian ... MarkMcDayter

We want to get things sorted by freq. This confusing line will do it:

> tweeters.count$screenName <- factor(tweeters.count$screenName, levels = arrange(tweeters.count, desc(freq))$screenName)

This rearranges the ordering of the rows in the tweeters.count data frame according to the ordering in that arrange statement at the end. Here's what it does, using head to show only the first few lines:

> head(arrange(tweeters.count, desc(freq)))
    screenName freq
1    verolynne   96
2          adr   88
3       mkny13   61
4  TheRealArty   57
5   SarahStang   55
6 weelibrarian   53

That's the ordering we want. So, with tweeters.count sorted properly, we can make the chart again, and this time pretty it up:

> ggplot(tweeters.count) + aes(x = screenName, y = freq) + geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 90, size = 4)) +
  ylab("Number of tweets") +
  xlab("Username") +
  labs(title="Who tweeted with #accessyul how much?")

Sorted tweet counts

I also wondered: how were these aggregated by time? When were the most tweets happening? The created column holds the clue to that minor mystery.

> head(y$created)
[1] "2012-10-21 10:37:59 EDT" "2012-10-21 10:36:44 EDT"
[3] "2012-10-21 10:36:40 EDT" "2012-10-21 10:36:38 EDT"
[5] "2012-10-21 10:34:55 EDT" "2012-10-21 10:34:49 EDT"

Those timestamps are to the minute. Let's collapse them down to the hour, aggregate them, and plot them:

> y$tohour <-  format(y$created, "%Y-%m-%d %H")
> head(count(y, "tohour"))
         tohour freq
1 2012-10-18 09   12
2 2012-10-18 10   16
3 2012-10-18 11   15
4 2012-10-18 12    5
5 2012-10-18 13    7
6 2012-10-18 14    8
> ggplot(count(y, "tohour"), aes(x=tohour, y=freq)) + geom_bar(stat="identity") +
xlab("Hour") + ylab("Number of tweets") +
labs(title="#accessyul tweets by hours") +
theme(axis.text.x = element_text(angle = 90, size = 4))

Tweets by hour

The busy days there are Friday and Saturday, and Friday goes nine-ten-ELEVEN and then nothing over the noon hour when we were at lunch. You can see the rest of the patterns. It would be interesting to correlate this to who was speaking, and to try to figure out what that means.

← I uses this #accessyul tweets chart →