May 17, 2011

Thoughts on Twitter's rate limit

This weekend I read an article about Twitter that, among other things, mentioned the limit in the number of tweets per user and the restriction that it creates on timeline searches. It made me think: really? Come on, it's just 140 characters! I have some space left in my drive, if they need it! :P

According to Wikipedia, Twitter has today around 200 million users that together post 65 million tweets a day, with a limit of 140 characters per tweet. This way, I decided to do some math to find out how much storage space this number of daily tweets requires and the first step was to learn how the 140 characters are counted.

Counting characters

Basically, Twitter uses the NFC form of the text to count the length of a tweet, which favors the combined character form. Special characters are represented by one codepoint, that is encoded as two bytes in UTF-8. You can find a lot more detailed information about it following their tutorial on counting characters and the references in it.

The math

Assuming the worst case scenario, in which all the 140 characters used and are special ones, a single tweet would take 280 bytes of storage space. If 200 million users generates 65 million tweets a day, they require almost 17GB of disk space a day. If a user is limited to post 1000 in their timeline, that means the 200 million users require something around 500TB of data.

Conclusion

Twitter has to manage 500TB of data, of which 17GB is daily modified. Considering that 75% of the requests to twitter.com are API calls, that means a lot of work on querying all that data. OK, leave my drive out of it!

Do you feel that this isn't much and they should raise the limits? Follow me on twitter and make me post more! :)

6 comments:

  1. I feel that it's not really an issue of space. It isn't called micro blogging for no reason. Twitter is more or less meant for quick status updates and such, that's its field.

    ReplyDelete
  2. Technically, UTF-8 characters can take up to four bytes each:

    http://en.wikipedia.org/wiki/UTF-8#Description

    I had to look this up, but all codepoints above U+0788 and below U+010000 seem to require three bytes in UTF-8. This includes all Chinese, Japanese, and Korean characters:

    http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes

    Stuff that requires four bytes seems to be constrained to extinct languages.

    ReplyDelete
  3. @kmps
    That's true! Also, I consider that Twitter is all about trends and hype, as they promote the trending topics. But a longer history would be interesting too for historical research purposes, for example.

    @leonsp
    Thanks for pointing that out! This way the amount of data could grow up to 1PB!

    ReplyDelete
  4. I think your estimate is more than a tad low. 280 bytes only gets 140 chars of the actual tweet text and includes no overhead for geo-location, user_ids, retweet count and all of the other data that is stored with each tweet.

    http://dev.twitter.com/doc/get/statuses/show/:id

    Yes, much of the fields shown at that URL are not repeated for every tweet and are stored elsewhere in the DB and merged together for the final tweet record return.

    Tweet ids and user ids themselves can now be longer than 14 digits.

    If you include indexes on just the tweet text and the few items I mentioned, I bet there is more space taken up overall by the overhead of storing and retrieving those 140chars than the actual space to store those few chars.

    Just my $0.02

    ReplyDelete
  5. @Damon
    Very true Damon, I didn't even think about all the tweet's metadata. They are even larger than the actual tweet! Also, the API allows you to query on that metadata too, so, even more workload!

    Thanks a lot for the contribution!

    ReplyDelete
  6. From what I read, the restrictions mainly apply to timeline searches, but they can be by-passed. Popular search engines like Google or Bing don't suffer from such a short memory-span.
    This article explains how to circumvent those constraints.

    ReplyDelete