Google: History? Backfiles Do Not Sell Ads

April 29, 2019

We spotted a very interesting article in Tablix: “Google Index Coverage”. We weren’t looking for the article, but it turned up in a list of search results and one of the DarkCyber researchers called it to my attention.

Background: Years ago we did a bit of work for a company engaged in data analysis related to the health and medical sectors. We had to track down the names of the companies who were hired by the US government to do some outsourced fraud investigation. We were able to locate the government statements of work and even some of the documents related to investigations. We noticed a couple of years ago that our bookmarks to some government documents did not resolve. With USA.gov dependent on Bing, we checked that index. We tried US government Web sites related to the agencies involved. Nope. The information had disappeared, but in one case we did locate documents on a US government agency’s Web site. The data were “there” but the data were not in Bing, Exalead, Google, or Yandex. We also checked the recyclers of search results: Startpage, the DuckDuck thing, and MillionShort.

We had other information about content disappearing from sites like the Wayback Machine too. From our work for assorted search companies and our own work years ago on ThePoint.com, which we sold to Lycos, we had considerable insight into the realities of paying for indexing that did not generate traffic or revenue. The conclusion we had reached and we assumed that other vendors would reach was:

Online search is not a “free public library.”

A library is/was/should be an archiving entity; that is, someone has to keep track and store physical copies of books and magazines.

Online services are not libraries. Online services sell ads as we did to Zima who wanted their drink in front of our users. This means one thing:

Web indexes dump costs.

The Tablix article makes clear that some data are expendable. Delete them.

Our view is:

Get used to it.

There are some knock on effects from the simple logic of reducing costs and increasing the efficiency of the free Web search systems. I have written about many of these, and you can search the 12,000 posts on this blog or pay to search commercial indexes for information in my more than 100 published articles related to search. You may even have a copy of one of my more than a dozen monographs; for example, the original Enterprise Search Reports or The Google Legacy.

  1. Content is disappearing from indexes on commercial and government Web sites. Examples range from the Tablix experience to the loss of the MIC contracts which detail exclusives for outfits like Xerox.
  2. Once the content is not findable, it may cease to exist for those dependent on free search and retrieval services. Sorry, Library of Congress, you don’t have the content, nor does the National Archives. The situation is worse in countries in Asia and Eastern Europe.
  3. Individuals — particularly the annoying millennials who want me to provide information for free — do not have the tools at hand to locate high value information. There are services which provide some useful mechanisms, but these are often affordable only by certain commercial enterprises, some academic research organizations, and law enforcement and intelligence agencies. This means that most people are clueless about the “accuracy”, “completeness,” and “provenance” of certain information.

Net net: If data generate revenue, it may be available online and findable. If the data do not, hasta la vista. The situation is one that gives me and my research team considerable discomfort.

Imagine how smart software trained on available data will behave? Probably in a pretty stupid way? Information is not what people believe it to be. Now we have a generation or two of people who think research is looking something up on a mobile device. Quite a combo: Ill informed humans and software trained on incomplete data.

Yeah, that’s just great.

Stephen E Arnold, April 28, 2019

Comments

One Response to “Google: History? Backfiles Do Not Sell Ads”

  1. John Sutherland on April 30th, 2019 8:38 am

    Interesting article. I think many of us suspected that the Internet library was being filtered by Google and others, but in my mind, the filtering was intentional and was intended to keep people from finding out the truth about certain historical facts, like facts about Germany in WW !!, and about the alleged holocaust. Apparently the problem is much worse.

    Deleting Internet resources for political reasons would tell us that government (both shadow and visible governments) and Internet companies work together to control us. Is that true? Many people, far smarter and more experienced than me seem to think that is the case.

    There was a time when Google simply provided search results for users, and yet today we all know it is actually controlling what we see on the Internet. Sad devolution of a once great idea.

    I started Highlander.com up once again early this year to try and resolve some of the history deletion problems that I saw happening due to political pressures. Silly me. I started out far too late. Much of the history is apparently already gone.

  • Archives

  • Recent Posts

  • Meta