An Interesting Hypothesis about Google Indexing

January 15, 2020

We noted “Google’s Crawl-Less Index.” The main idea is that something has changed in how Google indexes. We circled in yellow this statement from the article:

[Google’ can do this now because they have a popular web browser, so they can retire their old method of discovering links and let the users do their crawling.

The statement needs context.

The speculation is that Google indexes a Web page only when a user visits a page. Google notes the behavior and indexes the page.

What’s happening, DarkCyber concludes, is that Google no longer brute force crawls the public Web. Indexing takes place when a signal (a human navigating to a page) is received. Then the page is indexed.

Is this user-behavior centric indexing a reality?

DarkCyber has noted these characteristics of Google’s indexing in the last year:

Certain sites are in the Google indexes but are either not updated or updated selectively; for example, the Railway Pension Retiriement Board, MARAD, and similar sites
Large sites like the Auto Channel no longer have backfiles indexed and findable unless the user resorts to Google’s advanced search syntax. Then the results display less speedily than more current content probably due to the Google caches not having infrequently accessed content in a cache close to that user
Current content for many specialist sites is not available when it is published. This is a characteristic of commercial sites with unusual domains like dot co and for some blogs.

What’s going on? DarkCyber believes that Google is trying to reduce the increasing and very difficult to control costs associated with indexing new content, indexing updated content (the deltas), and indexing the complicated content which Web sites generate in chasing the dream of becoming number one for a Google query.

Search efficiency, as we have documented in our write ups, books, and columns about Google, boils down to:

Maximizing advertising value. That’s one reason why query expansion is used. Results match more ads and, thus, the advertiser’s ads get broader exposure.
Getting away from the old school approach of indexing the billions of Web pages. 90 percent of these Web pages get zero traffic; therefore, index only what’s actually wanted by users. Today’s Google is not focused on library science, relevance, precision, and recall.
Cutting costs. Cost control at the Google is very, very difficult. The crazy moonshots, the free form approach to management, the need for legions of lawyers and contract workers, the fines, the technical debt of a 20 year old company, the salaries, and the extras—each of these has to be controlled. The job is difficult.

Net net: Even wonder why finding specific information is getting more difficult via Google? Money.

PS: Finding timely, accurate information and obtaining historical content are more difficult, in DarkCyber’s experience, than at any time since we sold our ThePoint service to Lycos in the mid 1990s.

Stephen E Arnold, January 15, 2020

Written by Stephen E. Arnold · Filed Under Google, Indexing, News

Comments

One Response to “An Interesting Hypothesis about Google Indexing”

Jed Grant on January 15th, 2020 1:12 pm

It seems that this degrading of the Google index is giving a big leg up to DuckDuckGo – whose results are seeming more pertinent, recent and relevant by the day.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.