Why So Few Search Vendors Index the Web?

July 5, 2018

How many companies are indexing the Surface Web, the Dark Web, and the other bits and pieces which comprise the accessible Internet?

The answer is, “Not many most people can name.”

Another question, “Why don’t more companies just index the Internet?

The answer is, “Money, resources, time, expertise, and generating revenue.”

The write up from 2012 “How t Crawl a Quarter Billion Webpages in 40 Hours” surfaced again after an absence of six years. The article remains valid even thought the principal change in the last 72 months is the increased concentration of Google’s index. Microsoft, a company which insists that its Bing system, provides an alternative to Google has not significantly stopped Google’s market magnetism. Many of the systems which are marketed as Web indexes like Duckduckgo.com and Startpage.com are metasearch engines; that is, the users’ queries are passed to other services and may be supplemented with some original crawling. A bit of fiddling ensures that the results lists seem to be different. But there is a sameness to the result sets, particularly on popular queries. Yandex, the Russian Web search system, does a good job of handling certain sets of domains, but the overall coverage is not that different from what one can find in Google or its country centric indexes.

What’s interesting about “How to Crawl” from 2012 is the use of the Amazon system. This is important because the plumbing required to index the Internet can be large, complicated, and expensive.

Does Amazon still operate its A9 Web index? We have heard yes and no as an answer to this question. With a significant number of queries seeking product information, it makes sense to consider Amazon as a potential competitor to Bing, Google, and Yandex.

After rereading the “How to Crawl” paper, one thing jumps out. The notion that a quarter of a billion pages is a non trivial chunk of the Internet is interesting but a bit misleading. There may be upwards of more than 30 billion indexable Web pages. A large number of these content objects exist in mobile forms; thus, deduplication becomes an interesting issue. That’s why the Google has multiple indexes.

The big question becomes, “Is there another company able to compete with Google?”

After reading “How to Crawl” after a lapse of six years, the answer may be,

“Very, very few companies. And some of the outfits indexing the Surface and Hidden Internet may not make their activities public.”

Monocultures are okay but these can be vulnerable to something the monoculture cannot resist. Is Google like today’s banana? What happens if a blight attacks? One can shift to durian I suppose.

Stephen E Arnold, July 5, 2018

Written by Stephen E. Arnold · Filed Under News, Search

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.