History of Web Indexing: BBC Style

September 4, 2013

I read “Jonathon Fletcher: Forgotten Father of the Search Engine.” I have no quibble with the claims that the first Web crawler was an invention spawned in the United Kingdom.

I did find several interesting factoids in the write up; for example:

Google has indexed more than one trillion pages. On the surface, this sounds just super. However, what is the cost of maintaining the index of the alleged one trillion pages? Is Google cutting corners in its indexing to reduce costs? Perhaps the BBC will expand on this statement. A trillion is a big number and I wonder what percentage of those “pages” are indexed on a daily basis to keep the index fresh.
“Because websites were added to the list manually, there was nothing to track changes to their content. Consequently, many of the links were quickly out-of-date or wrongly labeled.” Is this true today?
“By June of 1994, JumpStation had indexed 275,000 pages. Space constraints forced Mr Fletcher to only index titles and headers of web pages, and not the entire content of the page, but even with this compromise, JumpStation started to struggle under the load.” Decades ago the black hole of Web indexing was visible. Now that Big Data have arrived, won’t indexing costs rise in lock step? What cost savings are available? Perhaps indexing less content and changing the index refresh cycles are expedient actions? Have Bing, Google, and Yandex gone down this path? Perhaps the BBC will follow up on this issue?
“But in my [Fletcher’s] opinion, the Web isn’t going to last forever. But the problem of finding information is.” Has progress been made in Web search?

One interesting aspect of the write up is the conflation of Web search with other types of search. The confusion persists I believe.

Perhaps the BBC will look into the contributions to search of Dr. Martin Porter, the inventor of the Porter Stemmer. Dr. Porter’s Muscat search technology was important, arguably more important than Mr. Fletcher’s.

Stephen E Arnold, September 4, 2013

Comments

One Response to “History of Web Indexing: BBC Style”

Charlie Hull on September 4th, 2013 9:30 am

Porter’s Muscat search engine was also (briefly) the foundation of a web search engine, Webtop: we built this in 1999 or so and indexed around half a billon web pages. Although Webtop and the Muscat business are long gone, the core search software survives as the Xapian open source project, still in use by companies such as the Financial Times.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.