Observing Malaysian Social Media

Twitter Search and the Streaming API for Research

In the last 3 days we tracked mentions of the GOP convention currently being held in Tampa Bay, Florida. One oversight we made was in estimating the number of tweets. In Malaysia, the volume of tweets for topics we track is relatively lower. So some missing tweets are acceptable. We had never tracked an American event before, and understimated the discrepancy between the amount of tweets provided by Twitter’s public API system, and the real amount of tweets.

Yesterday, Twitter’s official blog put up some statistics on the GOP convention, so we have a good comparison to make.

The Republican VP nominee also drove the top three peaks tonight in Tweets-per-minute, the highest coming at the conclusion of his speech: 6,669.

It is not clear if Twitter is only looking at #GOP2012 or taking other related tweets into account. From the data we collected yesterday, the highest peak was 1423 Tweets-per-minute. For #GOP2012 alone the peak was 550 tweets-per-minute. We can assume the amount of tweets collected from the Search API is an estimated (550/6669 *100), or 8.2% of the real total.

Twitter also stated:

Tweets about the #GOP2012 convention topped two million as Ryan took the stage—six times the Tweets sent about the 2008 conventions combined.

Do they mean two million on August 29th, or two million since August 28th, or since the weekend? Without a frame of reference we can’t make good use of that figure. We collected 212,810 mentions of #GOP2012 since Monday (August 27th). For August 29th alone, we have 89,308 tweets. We can assume the worst case is we got (89308/2000000*100), or 4.5% of the real total.

Reasons for Discrepancy

Public access to Twitter is limited to 2 application programming interfaces (API):

Twitter Search  

  1. Limited to tweets that Twitter considers ‘relevant’. Tweets and users that are considered spam are filtered out.
  2. What is obtained with this API is less than the real amount of tweets, and the discrepancy increases with how ‘hot’ the topic is.
  3. This API limits you to a maximum of 1,500 tweets for a search query, broken up into pages of up to 100 tweets each. In practice we find the only stable way of working with the API was to limit pages to 70 tweets, giving us a maximum of 1,050 tweets per query. Requesting more resulted in timeouts or zero tweets returned.
  4. Search results go back in time for up to 6 days. This is good for tracking tweets after finding out about an event.
  5. Works by posting a search request to Twitter and parsing the results, then submitting more requests if there is paging to be done. Susceptible to network latency issues.

The Streaming API

  1. Gives a small percentage of all tweets. Twitter has previously stated it gives 1% of all tweets, and this scales with the amount of tweets. So if there are 5 million tweets about all topics at the moment, the upper limit would be 50,000 tweets.
  2. Any search request will have this limit applied to it. If what we are searching for has only 10,000 tweets, we will get 100% of all tweets because 10,000<50,000.
  3. It is meant for real-time use. There was a back-fill option to get old tweets but it is missing from the documentation so it may have been removed.
  4. No filtering is done. Spam and spammers will be included.
  5. A connection is made to Twitter, after which tweets come in on a constant stream. Less susceptible to network latency issues.
  6. Search queries need to be defined before connecting to Twitter. If there is a need to add/remove searchterms, you have to terminate the connection and reconnect. There is a limit of 400 keywords, 5000 ‘follow’ users and 25 geo-located areas. The ‘follow’ users return tweets written by those users but not all @mentions of the user.
  7. The downside is if you are tracking something very popular that takes up >1% of all tweets, you will run into the 1% limit. What you get with the Streaming API also includes spam, so you would need your own tweet spam-detection technology.

When the real volume of tweets is high, then the rate of tweets coming in is faster than what gets indexed under Twitter Search, and what we end up collecting is only a sample of what was tweeted.  We only experienced this before during earthquake monitoring, when the pattern of data showed we were hitting a ceiling. This ceiling became more obvious during Day #3 of the GOP Convention (report pending). Based on this experience we can now expect results to be doubtful if they breach 400 tweets-per-minute. In this situation we need to look at tweets-per-second and find gaps (and there were such gaps during #GOP2012). From the size and frequency of the gaps we can try to deduce the real volume of tweets.

After 3 years of usage, we find our system can collect up to 1050 tweets-per-minute, per-searchterm. But the collection rate is influenced by how overloaded Twitter is. When it is under heavy load it tends to return only 0-100 tweets-per-minute. Within seconds a thousand tweets go by and that data is lost forever due to the 1,500 tweet limit. However with the Streaming API there is the 1% limit and the spam to filter out. Developing our own spam filter is not worthwhile.

Note that Twitter does not state how many of the 2 million tweets include spammers/bots that would have been filtered out by Twitter Search. Using Twitter Search we obtained 4.5 – 8.2% of the total tweets, which is a decent sample considering we have data for every minute.

Moving Forward

There is no way to use Twitter’s public API to track popular searchterms and expect to get 100% of all tweets. The only way to get complete data (such as the 2 million #GOP2012 tweets) is to use paid services such as Gnip and Datasift. Datasift charge based on the amount of data collected, which can be very costly. Gnip does not have a pricing plan listed.

It is not practical for political campaigners, event managers or researchers to pay-per-tweet for 2 million tweets for one event. So working with a sample is the only way to get things done, and this is likely what many online paid tracking systems are using.

For research purposes, using the Search API is still acceptable though its preferable to use both when it comes to events.  So tonight for #GOP2012 Day 4 (August 30th) and tomorrow for Malaysia’s Independence Day (August 31st), we will use both and merge the data to get the total tweet count. For determining context we will use the Twitter Search data, to take advantage of Twitter’s spam filtering.

Update #1 (31st August 2012)

Found the reference to back-fill in the ‘count’ parameter in Twitter Streaming API docs. So it is possible to request historical tweets, though how far back in time is unknown. The stated limit is 150000 tweets and its possible the 3-6 day limit from the Search API applies. But it is not available to the public as it requires elevated access.


Written by politweet

August 31, 2012 at 1:26 am

2 Responses

Subscribe to comments with RSS.

  1. […] wrote a blog post about the event. Explanation on the reason for this discrepancy is in this blog post. However the peaks in the graph are still relevant and do signify when the speakers/topic was most […]

  2. […] This graph shows tweet levels (mentions) and users tweeting per hour, from 00:00 – 23:59 on August 29th. The gap between the users and mentions indicates how hot the tweeting activity was. The almost vertical levels of tweets is an indication of a ‘ceiling’ that our system hit, because we used the Twitter Search API instead of the Streaming API. A discussion of that can be found in this blog post. […]

Comments are closed.

%d bloggers like this: