Gathering, Processing and counting all the tweets for #U2Request was no easy task, especially since there is a lot of room for error.
This is how I approached this project.
DISCLAIMER: Some of the details below may be too technical, and therefore boring,
but I wanted to be 100% clear on how this was all achieved, and leave no room for misunderstandings, or dispute of results.
Gathering the tweets
There are many different ways to gather a stream of tweets.
You can use Search, a 3rd party tool, you can write a client to talk to the REST API, or the Streaming API.
Talking to the Twitter API is obviously the most efficient way to get our stats, as we can then customize the data to fit our needs.
I used the Streaming API for real time analysis of the data and generation of the word cloud, and the REST API for post-event counting of requests.
To achieve all this, I wrote a client (in Python) that listens on the Streaming API of Twitter in real time and records all tweets with the #U2Request hashtag.
I then repeated the same with the REST API.
I deployed the client on an Amazon Web Services (AWS) instance here in Dublin. (the same place where the Word Cloud lived).
Processing the tweets
Fans were encouraged to tweet their requests on April 14th, between midnight and 23:59:59, their local time, to keep things simple.
The challenges here are:
- This is not a 24 hour period that we need to count, but a 24 hour period per timezone.
- What happens if someone tweets a request outside of their timezone?
How will I know whether the tweet is within the allowed time frame? It might be valid if tweeted from Australia,
but not if tweeted from Greece.
- What if someone tweets within their local timezone say in New York, but I retweet from Dublin?
Will my retweet still count?
- How do I know which tweets are valid and should be counted, and which ones were sent outside of the agreed times, and keep the results accurate?
This is another example where using Twitter Search or third party tools would not work for our purposes.
What I did was to use Coordinated Universal Time (UTC) as the standard timezone.
Then, each timezone on earth can be from UTC-12 to UTC+14. Dublin is currently UTC+1 for example. More info here
With this in mind, I processed every single tweet in UTC, and depending on the UTC offset of their country of origin, and came up with the following formula, that works for all timezones:
START_TIME_UTC + UTC_OFFSET <= TIME_OF_TWEET_IN_UTC < END_TIME_UTC + UTC_OFFSET
Producing the stats
All requests were matched against a database I built with all released and unreleased U2 tracks to date.
Some of the tweets did not follow the guidelines, and contained comments or lyrics other than the song name, or even typos, so they could not match the songs in the database. I sanitized those tweets with a series of scripts that performed regular expression filtering, to help make as many tweets possible count, rather than discard them.
- The word cloud may not match exactly the final report. This is normal. That is because a word cloud counts the frequency some words are mentioned, not song titles. So for example, the word "love" on the word cloud, could be referring to "Love Is Blindness" or "Everlasting Love" or "Hold On To Love", "One" might be also getting points from "One Three Hill" etc.
Moreover, I enabled a spam filter on the word cloud, since it was being updated in real time, in order not to get irrelevant words to #U2Request. Some words would have been caught by the filter, "Mofo" being a good example. Which explains why "Mofo" is not part of the word cloud.
- Twitter gives you the option to send your geo location along with your tweet. Some users had that feature enabled. I used that data to generate the Tweet Map above, which shows a small sample of requests, and where they came from.
I have intentionally kept the data anonymous, all you can see is a geo location, and the song that was requested from that location.
- The event's Facebook page also had around 1K of requests. We will be adding those to the final report within the next few days, but we do not anticipate that they will shift the results, due to their low volume.
Update: FB requests added.