Tribune Says: Google’s Automated Indexing Not Good

September 11, 2008

I have been a critic of Sam Zell’s Tribune since I tangled with the site for my 86 year old father. You can read my negative views of the site’s usability, its indexing, and its method of displaying content here.

Now on with my comments on this Marketwatch story titled “Tribune Blames Google for Damaging News Story” by John Letzing, a good journalist in my book. Mr. Letzing reports that Google’s automated crawler and indexing system could not figure out that a story from 2002 was old. As a result, the “old” story appeared in Google News and the stock of United Airlines took a hit. The Tribune, according to the story, blames Google.

Hold your horses. This problem is identical to the folks who say, “Index my servers. The information on them is what we want indexed.” As soon as the index goes live, these same folks complain that the search engine has processed ripped off music, software from mysterious sources, Cub Scout fund raising materials, and some content I don’t want to mention in a Web log. How do I know? I have heard this type of rationalization many times. Malformed XML, duplicate content, and other problems means content mismanagement, not bad indexing by a search systems.

Most people don’t have a clue what’s on their public facing servers. The content management system may be at fault. The users might be careless. Management may not have policies and create an environment in which those policies are observed. Most people don’t know that “dates” are assigned and may not correlate with the “date” embedded in a document. In fact, some documents contain many dates. Entity extraction can discover a date, but when there are multiple dates, which date is the “right one”? What’s a search system supposed to do? Well, search systems process what’s exposed on a public facing server or a source identified in the administrative controls for the content acquisition system.

Blaming a software system for lousy content management is a flashing yellow sign that says to me “Uninformed ahead. Detour around problem.”

Based on my experience with indexing content managed by people who were too busy to know what was on their machines, I think blaming Google is typical of the level of understanding in traditional media about how automated or semi automated systems work. Furthermore, when I examined the Tribune’s for fee service referenced in my description identified above, it was clear that the level of expertise brought to bear on this service was in my opinion rudimentary.

Traditional media is eager to find fault with Google. Yet some of these outfits use automated systems to index content and cut headcount. The indexing generated by these systems is acceptable, but there are errors. Some traditional publishers not only index in a casual manner, these publishers charge for each query. A user may have to experiment in order to find relevant documents. Each search puts money in the publisher’s pocket. The Tribune charges for an online service that is essentially unusable by my 86 year old father.

If a Tribune company does not know what’s on its servers and exposes those servers on the Internet, the problem is not Google’s. The problem is the Tribune’s.

Stephen Arnold, September 11, 2008

Written by Stephen E. Arnold · Filed Under Google, News, Online (general), Search, Technology

Comments

2 Responses to “Tribune Says: Google’s Automated Indexing Not Good”

Miss cj on September 11th, 2008 5:43 am

I agree with you, this is not a Google problem.
Stephen E. Arnold on September 11th, 2008 7:04 am

Miss cj

Ah, traditional publishers. Snow falling around the dinosaurs’
leads to empty trumpeting in the dark night.

Stephen Arnold, September 11, 2008

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.