Text Analytics SummitPolySpot: Agile Enterprise Search Infrastructure

Micro Mart’s Surprising Web Search Findings: Google Is an Also Ran

November 11, 2008

The trusty newsreader served up a link to a three-part article by Peter Hayes. He wrote a feature “The Secret Life of Search Engines” for Micromart.com. I have conflicting date information for this article. It may have been written yesterday or a year ago.

You can find the first part here. The second part here. And the third part here. The Micromart.com site search engine leaves a bit to be desired because its index does not contain a pointer to the first part of this article. Sigh. My own tools ferreted out the three parts, and I think you will find Mr. Hayes’ analysis surprising. The key point for me is that when a journalist runs benchmark queries across search systems, the gulf between those who understand what readers find interesting and those who build search engines becomes evident. In fact, if Mr. Hayes’ analysis were used as the definitive guide for finding information on the public Web, there would be considerable consternation at a number of high profile firms and cause for joy among a group of search engines that are going nowhere in terms of usage. I want to consider this point at the end of my Beyond Search post. Let’s look at the key points in each of the three parts of this analysis, shall we?

Part One: Outline Politics

Straight off let me say I don’t know what ‘outline politics means. I don’t think it matters much beyond privacy and the ambivalent nature of an index’s utility. I did not get the impression that the phrase is particularly significant in the flow of his argument. The series begins with the notion that you can make money offering a product people use everyday. The idea is flawless when it comes to a fungible product, but I am not sure it applies to the somewhat more slippery world of information. Nevertheless, the point is that traffic is good. Furthermore, the Internet is changing. Content is tricky. Mr. Hayes introduces the notion of official content and unofficial content. That’s a useful distinction, but it did not resonate with me. Mr. Hayes then asserts that search engines have, and I quote:

two major functions. One is to teach, the other is to search. While both have a large positive side we shouldn’t pretend that there isn’t a downside to any tool. Any tool used for good can also be used for bad.

He is now in full stride and hitting a hot button almost guaranteed to whip up interest among European Web uses–privacy. He then heads for the end of Part One with this comment:

My final thought is that search engines are only passengers on the Internet train and not the train itself. The growth of the Internet gives them the prospect of a healthy and prosperous future – but at the same time it is reliant on the safekeeping and update of the Internet to keep up with demand and to protect it from vandals. As our newspaper headlines tell us, the world is not totally a safe and law abiding place.

I must admit that I am not quite sure of the logic of this first section, but let’s move on to Part Two.

Part Two: Tools

Mr. Hayes dives in with location searching and touches upon Boolean logic, promising to tackle this topic elsewhere in his series. His first injunction is to keep a search simple. Web indexes are divided into systems dependent on software and systems dependent on humans. Mr. Hayes does not provide a context for the disparity in usage between these two types of systems, a distinction that will return to haunt him in Part Three of his series. He points out that search systems are not “born equal”. The promised analysis of Boolean arrives and I learn:

Boolean (which consists of the three words AND, OR, NOT, remember) is best explained by example. Some engines don’t allow it and some only use the NOT part. This follows the general rule that nothing to do with the Internet is ever totally straightforward! Typing NOT will take out examples that don’t fit the bill (‘Arsenal NOT soccer’, for example), but this is hard word to use and control. In Yahoo, double meanings are automatically divided out. Also the engine can easily come up with word connections that you would never think of in a million years – including simple names.

I think I understand even though Mr. Hayes’ own examples use symbols for AND, and he does not provide an example of a successful NOT search statement. NOT for Mr. Hayes is a “hard word to control”. I imagine that for him NOT may be troublesome. He points out that:

AND is the least useful of all because most of time, it is taken as read on all known engines that work via keywords. Type ‘Peter Hayes Writing Genius’ it will give the same result as ‘Peter+Hayes+Writing+Genius’ or ‘Peter AND Hayes AND Writing AND Genius’.

The statement confirms my suspicions that Mr. Hayes has taken a very different view of Boolean logic, its complexities, and the way in which logical operators work in his world. I quite like AND, NOT, OR, and even NAND in some systems. You too may find AND and NOT useful as well.

I am not certain what the sub section “Getting It Right” means. The resonance of AND and NOT inutility echoes in my mind. Part Two ends with an observation about how much of the Internet is indexed. That’s a good question, and I now turn to Part Three, where the intellectual rigor of Mr. Hayes meets the Information Superhighway, if I may indulge in a bit of metaphorical whimsy.

Part Three: The Best UK Web Search Engines

I knew I was in for a delightful few minutes after the first two parts of Mr. Hayes’ feature. In Part Three he lays out 10 test queries. I can’t reproduce the full list, but I can highlight two of his queries:

  • Bring me the site of the best selling newspaper in the UK (The Sun)
  • Find a local newspaper covering the Shetlands

I noted that each query is expressed as a string of text. Some vendors would rush to point out that Mr. Hayes is using natural language queries. Not many systems support natural language queries in particularly sophisticated ways. Some, for instance, create a Boolean query from whatever the user enters in the search box. Other systems consult a look up table of what’s been a satisfactory result for the query recently and delivers that result from its cache. Others dump stop words and go with the meaningful words with an simplicity AND or OR Boolean operator. Others look at what’s available from an advertiser and dumps those results directly to the user. Others predict what a user will prefer based on that user’s profile or the user’s usage history. This list is not exhaustive  by any means.

What did Mr. Hayes learn from his analysis of the 10 queries sent to the UK sites for Lycos, AltaVista, Dogpile, Excite, HotBot, Metacrawler, MSN, Yahoo, Ask, and Google. I have converted Mr. Hayes’ findings into the summary table below. Keep in mind that these are his data in a slightly different form. These are not my or my team’s findings:

Rank Engine Hayes’ Take
1 Lycos Answered questions well
2 AltaVista Useful but obscure results
3 Dogpile Surprised it didn’t do better
4 Excite Respectable performer
5 HotBot Good all round performer; Mr. Hayes’ favorite
6 Metacrawler Biggest surprise of the lot
7 MSN Slick and impressive performer
8 Yahoo Handpicked and categorized results a plus
9 Ask Plain English queries
10 Google Did not outperform the opposition

Mr. Hayes includes “scores” for each engine. The top rated engine Lycos received a Hayes number of 83%; the lowest rated engine Google received a Hayes number of 78%.

Observations

I came away from my reading of this three part series in a semi stunned state. I had a number of major and minor quibbles gallivanting around my cranial cavity. Let me highlight three points and move on:

  1. This article made it clear to me that people don’t know what they don’t know about Web search, its technology, and its nuances. Google is probably correct in sticking with its very simple interface and its behind the scenes functions to answer most of the users’ questions with “good enough” information with its approach to results. If Mr. Hayes is an informed user of Web search systems, the fact that he finds the HotBot results more useful to him than other systems’ results, that’s well and good. The idea of using one system to conduct research of any type is an anathema to me. Overlap, freshness, scope of index–these are essential factors for each Web indexing system. Insensitivity to these issues makes me downright nervous. I thought, “If Mr. Hayes can’t figure out the important parts, what about a less informed online user?”
  2. The queries Mr. Hayes formulated reveal why natural language systems are not understood. Forget semantic methods. I am not sure how to remediate Mr. Hayes’ test queries. The approach is foreign to me as is Mr. Hayes’ failure to differentiate each of the test systems with more precision. There is a big difference between a system that is federating results, one that indexes only frequently accessed pages, and one that operates with orphaned code on a shoestring.
  3. The failure to point out that Google serves about 70 percent of the queries in North America and more in Denmark, Germany, and the UK is an oversight. The giant gets the lowest score, which doesn’t make sense to me. Mr. Hayes uses subjective criteria to generate his Hayes numbers and provides zero detail about the method used to calculate a score. I think the idea of scoring Lycos as a better search engine on freshness, features, relevance as measured by the number of on target hits in the first 10,000 results in a result set, and similar criteria will suggest that Lycos, AltaVista, and HotBot aren’t competitive in today’s market. Microsoft’s Live.com and Yahoo search are in some ways easier to benchmark against the Google. The other vendors are non starters in my mind because none has the technical nor financial resources to index at the Google, Microsoft Live.com, and Yahoo levels.

Mr. Hayes omitted a Web search engine that I think is better than eight or nine of those on this list; namely, Exalead. I am well pleased with the results I obtain from Exalead.com here. In general, the French make me nervous with the math skills and sense of style, but Exalead is the functional equivalent of Google, operated by Europeans, and a country mile better on my relevance tests than the orphans AltaVista, Excite, and HotBot.

Keep in mind I am stating my opinion. I am an addled goose. I am sure the experts who organize search conferences will be delighted to feature Mr. Hayes as a keynote speaker. The conference organizers and Mr. Hayes’ understanding of search may be well matched.

Stephen Arnold, November 11, 2008

Comments

Comments are closed.

  •  Only search links from this page: