Observing Malaysian Social Media

Archive for the ‘Doing Research on Twitter’ Category

The Most Followed Twitter Users in Malaysia (Oct 2017)

Since 2014 we have been building up a database of profiled Twitter users in Malaysia. We currently have over 630,000 profiled user accounts that are location-based. What this means is that we can analyse opinions and interests not just by state, but by area (e.g. cities, constituencies, campuses, malls, suburbs / taman). We have demonstrated the application of this database for opinion analysis (browse here) and by-elections (link). We are currently working on improving the level of detail for our profiles and are now sharing part of our research results with the public.

Using a sample of 24,677 users from our database, we collected their lists of Twitter ‘friends’ (user accounts that people follow). This resulted in a list of 2.07 million users. This list was then used to summarise the top 207,500 most-followed users by users in Malaysia.

The Top 10 most-followed Twitter users are below:

Rank @ScreenName Name Market Reach (%)
1 instagram Instagram 39.385
2 Khairykj Khairy Jamaluddin 35.665
3 9GAG 9GAG 32.577
4 Matluthfi90 Matluthfi90 27.459
5 yunamusic Yuna Zarai 25.064
6 501Awani Astro AWANI 23.982
7 NajibRazak Mohd Najib Tun Razak 23.625
8 waktuSolatKL Waktu Solat WP KL 22.653
9 SantapanMinda Santapan Minda 22.134
10 ustazharidrus Ust Azhar Idrus 21.818

Market reach is defined as the percentage of users in Malaysia who follow that Twitter user. Based on this list, Khairy Jamaluddin (MP for Rembau, Minister of Youth and Sports, UMNO Youth Leader) is both the most-followed person and most-followed Malaysian in the country. But his market reach is only 35.665% of users in Malaysia. This shows that no single user on Twitter ‘owns’ the Malaysian market. Because we are using profiled users, the possibility of fake followers (or phantoms, fake accounts etc.) is a non-issue.

The Top 10 users have a combined market reach of 82.25%. Most Twitter users in Malaysia have a market reach that would be considered small. But a small market reach does not mean that a tweet has no chance of going viral. Due to the high degree of connectivity between Twitter users plus the Twitter Search factor, there is always a chance for a tweet getting retweeted and spread throughout the network.

Using the data that we collected, we performed a network analysis on how the most-followed Twitter users are connected to each other based on their followers. For this analysis we used the top 4,704 users. This covers all user accounts followed by users in Malaysia with a minimum market reach of 0.61%.

Users that have a shared appeal (affinity) will have overlapping audiences, which is equal to strong connections if the overlap is high. For example, users that tweet primarily about football will draw interest from other people who like football.

Based on the network analysis we generated a map showing clusters of users with a strong affinity for each other. Based on where they are in the map, you can see the affinity that different popular users have with each other. Users with a greater market reach are shown in a larger font, coloured from a scale ranging from blue (least popular) to orange to red (most popular).


The full-size version can be viewed at our Flickr page here.

At a glance you can see that the top users are close to each other where @Khairykj and @instagram are visible. As stated earlier the Top 10 users have a combined market reach of 82.25%. Despite the fact that these users don’t tweet about the same topics, their proximity to each other is due to their mass market appeal.


Read the rest of this entry »


Twitter Search and the Streaming API for Research

In the last 3 days we tracked mentions of the GOP convention currently being held in Tampa Bay, Florida. One oversight we made was in estimating the number of tweets. In Malaysia, the volume of tweets for topics we track is relatively lower. So some missing tweets are acceptable. We had never tracked an American event before, and understimated the discrepancy between the amount of tweets provided by Twitter’s public API system, and the real amount of tweets.

Yesterday, Twitter’s official blog put up some statistics on the GOP convention, so we have a good comparison to make.

The Republican VP nominee also drove the top three peaks tonight in Tweets-per-minute, the highest coming at the conclusion of his speech: 6,669.

It is not clear if Twitter is only looking at #GOP2012 or taking other related tweets into account. From the data we collected yesterday, the highest peak was 1423 Tweets-per-minute. For #GOP2012 alone the peak was 550 tweets-per-minute. We can assume the amount of tweets collected from the Search API is an estimated (550/6669 *100), or 8.2% of the real total.

Twitter also stated:

Tweets about the #GOP2012 convention topped two million as Ryan took the stage—six times the Tweets sent about the 2008 conventions combined.

Do they mean two million on August 29th, or two million since August 28th, or since the weekend? Without a frame of reference we can’t make good use of that figure. We collected 212,810 mentions of #GOP2012 since Monday (August 27th). For August 29th alone, we have 89,308 tweets. We can assume the worst case is we got (89308/2000000*100), or 4.5% of the real total.

Reasons for Discrepancy

Public access to Twitter is limited to 2 application programming interfaces (API):

Twitter Search  

  1. Limited to tweets that Twitter considers ‘relevant’. Tweets and users that are considered spam are filtered out.
  2. What is obtained with this API is less than the real amount of tweets, and the discrepancy increases with how ‘hot’ the topic is.
  3. This API limits you to a maximum of 1,500 tweets for a search query, broken up into pages of up to 100 tweets each. In practice we find the only stable way of working with the API was to limit pages to 70 tweets, giving us a maximum of 1,050 tweets per query. Requesting more resulted in timeouts or zero tweets returned.
  4. Search results go back in time for up to 6 days. This is good for tracking tweets after finding out about an event.
  5. Works by posting a search request to Twitter and parsing the results, then submitting more requests if there is paging to be done. Susceptible to network latency issues.

The Streaming API

  1. Gives a small percentage of all tweets. Twitter has previously stated it gives 1% of all tweets, and this scales with the amount of tweets. So if there are 5 million tweets about all topics at the moment, the upper limit would be 50,000 tweets.
  2. Any search request will have this limit applied to it. If what we are searching for has only 10,000 tweets, we will get 100% of all tweets because 10,000<50,000.
  3. It is meant for real-time use. There was a back-fill option to get old tweets but it is missing from the documentation so it may have been removed.
  4. No filtering is done. Spam and spammers will be included.
  5. A connection is made to Twitter, after which tweets come in on a constant stream. Less susceptible to network latency issues.
  6. Search queries need to be defined before connecting to Twitter. If there is a need to add/remove searchterms, you have to terminate the connection and reconnect. There is a limit of 400 keywords, 5000 ‘follow’ users and 25 geo-located areas. The ‘follow’ users return tweets written by those users but not all @mentions of the user.
  7. The downside is if you are tracking something very popular that takes up >1% of all tweets, you will run into the 1% limit. What you get with the Streaming API also includes spam, so you would need your own tweet spam-detection technology.

When the real volume of tweets is high, then the rate of tweets coming in is faster than what gets indexed under Twitter Search, and what we end up collecting is only a sample of what was tweeted.  We only experienced this before during earthquake monitoring, when the pattern of data showed we were hitting a ceiling. This ceiling became more obvious during Day #3 of the GOP Convention (report pending). Based on this experience we can now expect results to be doubtful if they breach 400 tweets-per-minute. In this situation we need to look at tweets-per-second and find gaps (and there were such gaps during #GOP2012). From the size and frequency of the gaps we can try to deduce the real volume of tweets.

After 3 years of usage, we find our system can collect up to 1050 tweets-per-minute, per-searchterm. But the collection rate is influenced by how overloaded Twitter is. When it is under heavy load it tends to return only 0-100 tweets-per-minute. Within seconds a thousand tweets go by and that data is lost forever due to the 1,500 tweet limit. However with the Streaming API there is the 1% limit and the spam to filter out. Developing our own spam filter is not worthwhile.

Note that Twitter does not state how many of the 2 million tweets include spammers/bots that would have been filtered out by Twitter Search. Using Twitter Search we obtained 4.5 – 8.2% of the total tweets, which is a decent sample considering we have data for every minute.

Moving Forward

There is no way to use Twitter’s public API to track popular searchterms and expect to get 100% of all tweets. The only way to get complete data (such as the 2 million #GOP2012 tweets) is to use paid services such as Gnip and Datasift. Datasift charge based on the amount of data collected, which can be very costly. Gnip does not have a pricing plan listed.

It is not practical for political campaigners, event managers or researchers to pay-per-tweet for 2 million tweets for one event. So working with a sample is the only way to get things done, and this is likely what many online paid tracking systems are using.

For research purposes, using the Search API is still acceptable though its preferable to use both when it comes to events.  So tonight for #GOP2012 Day 4 (August 30th) and tomorrow for Malaysia’s Independence Day (August 31st), we will use both and merge the data to get the total tweet count. For determining context we will use the Twitter Search data, to take advantage of Twitter’s spam filtering.

Update #1 (31st August 2012)

Found the reference to back-fill in the ‘count’ parameter in Twitter Streaming API docs. So it is possible to request historical tweets, though how far back in time is unknown. The stated limit is 150000 tweets and its possible the 3-6 day limit from the Search API applies. But it is not available to the public as it requires elevated access.

Written by politweet

August 31, 2012 at 1:26 am