Google Addresses Index Staleness

June 10, 2010

Next week, I am giving two lectures about what is now one of the touchstones of 2010: real time. I will put up some extracts from these lectures in the next week or so. What I want to do this morning is cal your attention to a post from Google called “Our New Search Index: Caffeine.” I think the nod to the fizzy drinks that gives club goers and sluggish 20 somethings is interesting.

Most users of a search and retrieval system have zero clue about when the index was updated or assembled. The 20 something wizards assume that if an index is available from an electronic device, that index is up to the minute or even more current.

Most online system users have zero clue about when data were created, when those data were processed, when those index pointers were updated, or what other factors may have slammed on the search system’s air brakes. Ever hear this in an organization: “I know my version of the PowerPoint should be in the system but I can’t find it.” I do. Frequently.

The Google write up makes clear in a Googley sort of way wants to try and cope with streams of information from Twitter and Facebook. Traffic from social sites either has reached parity with search traffic or it has surpassed the traffic. I have some information in Overflight, and I will post one or two items that document this usage shift. Users seem to prefer the what looks to most people like “real time”. A traditional indexing system does not do real time with Mikhail Nikolaevich Baryshnikov’s agility.

Here’s what the Googlers said:

Caffeine lets us index web pages on an enormous scale. In fact, every second Caffeine processes hundreds of thousands of pages in parallel. If this were a pile of paper it would grow three miles taller every second. Caffeine takes up nearly 100 million gigabytes of storage in one database and adds new information at a rate of hundreds of thousands of gigabytes per day. You would need 625,000 of the largest iPods to store that much information; if these were stacked end-to-end they would go for more than 40 miles. We’ve built Caffeine with the future in mind. Not only is it fresher, it’s a robust foundation that makes it possible for us to build an even faster and comprehensive search engine that scales with the growth of information online, and delivers even more relevant search results to you. So stay tuned, and look for more improvements in the months to come.

The idea is that Google which has numerous ways of skinning the content processing cat has grabbed the digital Red Bull.

image

© Google 2010.

I have no doubt that the freshness of certain types of content is going to benefit. However, I am not sure that Google will be able to handle its vast content processing needs with the ballet grace the nuclear logo in the blog post suggests. Furthermore, I don’t think that most users understand that whatever Google does to process content more quickly and update its indexes deals with some of the thorny underlying issues. I address these in my lecture, but the user is unlikely to know about latency elsewhere in the content ecosystem.

The notion of “real time” is slippery. The notion of an index’s “freshness” is slippery. The problem is a complex one. Why do you think that financial institutions pay really big bucks for products from Exegy and Thomson Reuters to deal with freshness? The reason? Speed that can be documented from moment of information creating, acquisition, processing, and availability. For freshness, be prepared to spend big money.

For a temporary pick me up, guzzle the caffeine-laced beverages from the 7-11. I might just recommend that you turn to http://search.twitter.com and look for a tip on where to buy a Jolt at a discount. Just my opinion.

Stephen E Arnold, June 10, 2010

A freebie. No coupons for a complementary can of Jolt.

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta