History of Web Indexing: BBC Style
September 4, 2013
I read “Jonathon Fletcher: Forgotten Father of the Search Engine.” I have no quibble with the claims that the first Web crawler was an invention spawned in the United Kingdom.
I did find several interesting factoids in the write up; for example:
- Google has indexed more than one trillion pages. On the surface, this sounds just super. However, what is the cost of maintaining the index of the alleged one trillion pages? Is Google cutting corners in its indexing to reduce costs? Perhaps the BBC will expand on this statement. A trillion is a big number and I wonder what percentage of those “pages” are indexed on a daily basis to keep the index fresh.
- “Because websites were added to the list manually, there was nothing to track changes to their content. Consequently, many of the links were quickly out-of-date or wrongly labeled.” Is this true today?
- “By June of 1994, JumpStation had indexed 275,000 pages. Space constraints forced Mr Fletcher to only index titles and headers of web pages, and not the entire content of the page, but even with this compromise, JumpStation started to struggle under the load.” Decades ago the black hole of Web indexing was visible. Now that Big Data have arrived, won’t indexing costs rise in lock step? What cost savings are available? Perhaps indexing less content and changing the index refresh cycles are expedient actions? Have Bing, Google, and Yandex gone down this path? Perhaps the BBC will follow up on this issue?
- “But in my [Fletcher’s] opinion, the Web isn’t going to last forever. But the problem of finding information is.” Has progress been made in Web search?
One interesting aspect of the write up is the conflation of Web search with other types of search. The confusion persists I believe.
Perhaps the BBC will look into the contributions to search of Dr. Martin Porter, the inventor of the Porter Stemmer. Dr. Porter’s Muscat search technology was important, arguably more important than Mr. Fletcher’s.
Stephen E Arnold, September 4, 2013
Sponsored by Xenky
Comments
One Response to “History of Web Indexing: BBC Style”
Porter’s Muscat search engine was also (briefly) the foundation of a web search engine, Webtop: we built this in 1999 or so and indexed around half a billon web pages. Although Webtop and the Muscat business are long gone, the core search software survives as the Xapian open source project, still in use by companies such as the Financial Times.