Why So Few Search Vendors Index the Web?
July 5, 2018
How many companies are indexing the Surface Web, the Dark Web, and the other bits and pieces which comprise the accessible Internet?
The answer is, “Not many most people can name.”
Another question, “Why don’t more companies just index the Internet?
The answer is, “Money, resources, time, expertise, and generating revenue.”
The write up from 2012 “How t Crawl a Quarter Billion Webpages in 40 Hours” surfaced again after an absence of six years. The article remains valid even thought the principal change in the last 72 months is the increased concentration of Google’s index. Microsoft, a company which insists that its Bing system, provides an alternative to Google has not significantly stopped Google’s market magnetism. Many of the systems which are marketed as Web indexes like Duckduckgo.com and Startpage.com are metasearch engines; that is, the users’ queries are passed to other services and may be supplemented with some original crawling. A bit of fiddling ensures that the results lists seem to be different. But there is a sameness to the result sets, particularly on popular queries. Yandex, the Russian Web search system, does a good job of handling certain sets of domains, but the overall coverage is not that different from what one can find in Google or its country centric indexes.
What’s interesting about “How to Crawl” from 2012 is the use of the Amazon system. This is important because the plumbing required to index the Internet can be large, complicated, and expensive.
Does Amazon still operate its A9 Web index? We have heard yes and no as an answer to this question. With a significant number of queries seeking product information, it makes sense to consider Amazon as a potential competitor to Bing, Google, and Yandex.
After rereading the “How to Crawl” paper, one thing jumps out. The notion that a quarter of a billion pages is a non trivial chunk of the Internet is interesting but a bit misleading. There may be upwards of more than 30 billion indexable Web pages. A large number of these content objects exist in mobile forms; thus, deduplication becomes an interesting issue. That’s why the Google has multiple indexes.
The big question becomes, “Is there another company able to compete with Google?”
After reading “How to Crawl” after a lapse of six years, the answer may be,
“Very, very few companies. And some of the outfits indexing the Surface and Hidden Internet may not make their activities public.”
Monocultures are okay but these can be vulnerable to something the monoculture cannot resist. Is Google like today’s banana? What happens if a blight attacks? One can shift to durian I suppose.
Stephen E Arnold, July 5, 2018