June 14, 2009


Google's Twitter problem

Some folks speculate that Google is launching vertical search for micro-blogging services. Given the lack of competition for Twitter, it means, Google is launching a vertical search for Twitter.

Let's find out how many tweets are generated each month. Twitter gives unique identity to each message, which could be found at the end of this URL. http://twitter.com/kshashi/status/2156645977 That's a sequential number for each message. From my own stream I found this number for 14th June (today), 14th May, ... 14th Jan. (I realized, I should have picked 1st of each month instead. Never mind.) This is not most accurate, but should be close enough. Here is the chart for # of tweets for each month starting on 14th.

For the month preceding 14th Feb, the # of tweets was 91 million. 4 months later this figure stands at 364 million. That's a 300% absolute increase. In another 3-4 months, Twitter will hit the Billion tweets a month mark.

Come back to Google and spare a though about how Google indexes. Google crawls each of these tweets as html pages. Each page is almost 8KB in size. 8KB for information which is closer to 140 bytes in most cases. Storage is not such a issue as Google must already be handling exabytes of data. There is a whole lot of processing on this data before one can extract meaningful information. Each of these html pages need to be parsed to figure out the actual tweet. Almost all of these tweets are never going to have any inlink from other page. So, that processing is wasteful. This wasteful computing is not so much about $$ lost, but the added delay hampers the core offering of "real-timeness." All this processing is necessary as by the time you know this tweet is useless, you have already spent your time to find it uselessness.

I wonder if Twitter will go out of its way to provide fat pipe to Google through which all the tweets are transported. I suspect it's highly unlikely. (Why? I leave it as an exercise for you.) This "problem" is only going to be worse by each day as Twitter's popularity continues to rise. And the solution to this problem has wildly and widely been speculated out there for a long time.

(Yours truly is contributing to this problem in a small way at @kshashi )


>>Each of these html pages need to >>be parsed to figure out the >>actual >>tweet. Almost all of >>these tweets >>are never going >>to have any inlink >>from other >>page. So, that >>processing is >>wasteful.

Did not understand this completely.Since speculations are ripe about a vertical microblogging service, they could easily start with having list of xpath queries per domain like twitter and get to the individual tweet.In case of twitter, the tweets are embedded within "li"tags. They anyways have to create a DOM for the web-page, and I am sure they must be currently extracting features like images, text in bold, titles etc. So how much of a burden will an extra query impose?

In case they do not want to have hard coded xpaths per domain, they could also exploit the fact that the tweets will occur in some form of a repeating pattern within the page, and write a generic extraction routine because the scope of the extraction problem is well defined and limited.

The problem is creating the DOM object itself. Each page of tweet needs to be parsed like a normal webpage (which I suppose involves some heavy-duty analysis.) I feel, that's overkill.

Google has all the processing power to do this with "reasonable" speed, but it is obvious that this processing is not of much use for twitter search.

With more than 10m tweets everyday and growing fast, Google isn't going to have it easy.
About the DOM object creation, when we were talking with Webaroo, I was told that you guys were parsing 40 html pages a second. Ravindra Jaju's Hypar was also hitting the same benchmark. For 10 million tweets daily, it should take about 70 hours. Split tasks in parallel, you can do that in far lesser time.

I think that the Google crawlers continuously tracking the twitter domain will unnecessarily add to the burden. Like you mentioned, a fat pipe for which Google pays twitter and a smarter RSS kind of protocol could be a better solution.
What was Summize (now twitter search) doing? I guess they were using twitter's API to provide real time search results. No need to parse the page to find out a tweet.

or am i way off?

You are right that summize was probably working with feeds. But, Google can't do the same as for each search result they (need to) give cached page link which means they actually crawl the html page.

Also, Summize was working on far small data. The volume on twitter has gone up by an order of magnitude or two in last 12 months.
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?