Search: The Problem with Words and Their Misuse
January 30, 2008
I rely on several different types of alerts, including Yahoo’s service, to keep pace with developments in what I call “behind the firewall search”.
Today was particularly frustrating because the number of matches for the word “search” has been increasing, particularly since the Microsoft – Fast Search & Transfer acquisition and the Endeca cash injection from Intel and SAP. My alerts contain a large number of hits, and I realized that most of these are not about “behind the firewall” search, nor chock full of substantive information. Alerts are a necessary evil, but over the years, the primitive key word indexing offered by free services don’t help me.
The problem is the word search and its use or misuse. If you know of better examples to illustrate these types of search, please, post them. I’m interested in learning about sites and their search technology.
I have a so-so understanding of language drift, ambiguity, and POM (plain old marketing) work. For someone looking for information about search, the job is not getting easier. In fact, search has become such a devalued term that locating information about a particular type of search requires some effort. I’ve just finished compiling the Glossary for “Beyond Search”, due out in April 2008 from the Gilbane Group, a high-caliber outfit in the Boston, Massachusetts area. So, terminology is at the top of my mind this morning.
Let’s look at a few terms. These are not in alphabetical order. The order is by their annoyance factor. The head of the list contains the most annoying terms to me. The foot of the list are terms that are less offensive to me. You may not agree. That’s okay.
Vertical search. Number one for 2008. Last year it was in second place. This term means that a particular topic or corpus has been indexed. The user of a vertical search engine like Sidestep.com sees only hits in the travel area. As Web search engines have done a better and better job of indexing horizontal content — that is, on almost every topic — vertical search engines narrow their focus. Think deep and narrow, not wide and shallow. As I have said elsewhere, vertical search is today’s 20-somethings rediscovering how commercial databases handled information in the late 1970s with success then but considerably less success today.
Search engine marketing. This is last year’s number one. Google and other Web engines are taking steps to make it harder to get junk sites to the top of a laundry list of results. This phrase search engine marketing is the buzzword for the entire industry of getting a site on the first page of Google results. The need to “rank high” and has made some people “search gurus”. I must admit I don’t think too much of SEM, as it is called. I do a reasonable job of explaining SEM in terms of Google’s Webmaster guidelines. I believe that solid content is enough. If you match that with clean code, Web indexing bots will index the information. Today’s Web search systems do a good job of indexing, and there are value-added services such as Clusty.com that add metadata, whether the metadata exists on the indexed sites or not. When I see the term search used to mean SEM, I’m annoyed. Figuring out how to fool Google, Microsoft Live.com, or Yahoo’s indexing systems is not something that is of much interest to me. Much of the SEM experts’ guidance amounts to repeating Google’s Web master guidelines and fiddling with page elements until a site moves up in the rankings. Most sites lack substantive content and deserve to be at the bottom of the results list. Why do I want to have in my first page of results a bunch of links to sites without heft? I want links to pages significant enough to get to the top of results list because of solid information, not SEM voodoo. For basics, check out “How Stuff Works.”
Guided, faceted, assisted, and discovery search. The idea that is difficult to express in words and phrases is a system that provides point-and-click access to related information. I’ve heard a variation on these concepts expressed as drill-down search or exploratory search. These are 21st-century buzzwords for “Use For” and “See Also” references. But by the time a vendor gets done explaining taxonomies, ontologies, and controlled term lists, the notion of search is mired in confusion. Don’t get me wrong. Rich metadata and exposed links to meaningful “See Also” and “Use For” information is important. I’m just burned out with companies using these terms when their technology can’t deliver.
Enterprise search. I do not know what “enterprise search” is. I do know that there are organizations of all types. Some are government agencies. Some are non-profit organizations. Some are publicly-traded companies. Some are privately held companies. Some are professional services corporations. Some are limited liability corporations. Each has a need to locate electronic information. There is no one-size-fits-all content processing and retrieval system. I prefer the phrase “behind the firewall search.” It may not be perfect, but it makes clear that the system must function in a specific type of setting. Enterprise search has been overused, and it is now too fuzzy to be useful from my point of view. A related annoyance is the word “all”. Some vendors say they can index “all the organization’s information.” Baloney. Effective “behind the firewall” systems deliver information needed to answer questions, not run afoul of federal regulations regarding health care information, incite dissatisfaction by exposing employee salaries, or let out vital company secrets that should be kept under wraps.
Natural language search. This term means that the user can type a question into a system. A favorite query is, “What are the car dealerships in Palo Alto?” You can run this query on Google or Ask.com. The system takes this “natural language question”, coverts it to Boolean, and displays the results. Some systems don’t do anything more than display a cached answer to a frequently asked question. The fact is that most users–exceptions include lawyers and expert intelligence operatives–don’t do “natural lanaguage queries”. Most users type some words like weather 40202 and hit the Enter key. NLP sounds great and is often used in the same sentence with latent semantic indexing, semantic search, and linguistic technology. These are useful technologies, but most users type their 2.3 words and take the first hit on the results list.
Semantic search. See natural language search. Semantic technologies are important and finally practical in every day business operations. Used inside search systems, today’s fast processors and cheap storage make it possible to figure out some nuances in content and convert those nuances to metatags. It’s easy for vendors to bandy about the term semantic and Semantic Web than explain what it delivers in terms of precision and recall. There are serious semantic-centric vendors, and there are a great many who use the phrase because it helps make sales. An important vendor of semantic technology is Siderean Software. I profile others in “Beyond Search”.
Value-added search. This is a coinage that means roughly, “When our search system processes content, we find and index more stuff.” “Stuff”, obviously, is a technical word that can mean the file type or concepts and entities. A value-added search system tries to tag concepts and entities automatically. Humans used to do indexing but there is too much data and not enough skilled indexers. So, value-added search means “indexing like a human used to do.” Once a result set has been generated, value-added search systems will display related information; that is, “See Also” references. An example is Internet the Best. Judge for yourself if the technique is useful.
Side search. I like this phrase. It sounds nifty and means nothing to most people in a vendor’s marketing presentation. What I think the vendors who use this term mean is additional processes that run to generate “Use For” and “See Also” references. The implication is that the user gets a search bonus or extra sugar in their coffee. Some vendors have described a “more like this” function as a side search. The idea is that a user sees a relevant hit. By clicking the “more like this” hot link, the system uses the relevant hit as the basis of a new, presumably more precise, query. A side search to me means any automatic query launched without the user having to type in a search box. The user may have to click the mouse button, but the heavy lifting is machine-assisted. Delicious offers a side search labeled as related terms. Just choose a tag from the list of the right side of the Web page, and you see more hits like these. The idea is that you get related information without reentering a query.
Sentiment search. I have just looked at a new search system called Circos. This system lets me search in “color”. The idea is that emotions or feeling can be located. People want systems that provide a way to work emotion, judgment, and nuance into their results. Lexalytics, for examples, offers a useful, commercial system that can provide brand managers with data about whether customers are positive or negative toward the brand. Google, based on their engineering papers, appears to be nosing around in this sentiment search as well. Worth monitoring because using algorithms to figure out if users like or dislike a person, place, or thing can be quite significant to analysts.
Visual search. I don’t know what this means. I have seen the term used to describe systems that allow the user to click on pictures in order to see other pictures that share some colors or shapes of the source picture. If you haven’t seen Kartoo, it’s worth a look. Inxight Software offers a “search wall”. This is a graphic representation of the information in a results list or a collection as a three-dimensional brick wall. Each brick is a content object. I liked the idea when I first saw in five or six years ago, but I find visual search functionality clunky. Flying hyperbolic maps and other graphic renderings have sizzle, but instead of steak I get boiled tofu.
Parametric search. Structured search or SQL queries with training wheels are loose synonyms for parametric search and close enough for horse shoes. The term parametric search has value, but it is losing ground to structured search. Today, structured data are fuzzed with unstructured data by vendors who say, “Our system supports unstructured information and structured data.” Structured and unstructured data treated as twins, thus making it hard for a prospect to understand what processes are needed to achieve this delightful state. These data can then be queried by assisted, guided, or faceted search. Some of the newer search systems are, at their core, parametric systems. These systems are not positioned in this way. Marketers find that customers don’t want to be troubled by “what’s under the hood.” So, “fields” become metatags, and other smoothing takes place. It is no surprise to me that content processing procurement teams struggle to figure out what a vendor’s system actually does. Check out Thunderstone‘s offering and look for my Web log post about parametric (structured search) in a day or two. In Beyond Search, I profile two vendors’ systems each with different but interesting parametric search functionality. Either of these two vendors’ solutions can help you deal with the structured – unstructured dichotomy. You will have to wait until April 2008 when my new study comes out. I’m not letting these two rabbits out of my hat yet.
Unstructured search. This usually implies running a query against text that has been indexed for its key words because the source lacks “tags” or “field names”. Email, PDFs, and some Word documents are unstructured. A number of content processing systems can also index bound phrases like “stock market” and “white house”. Others include some obvious access points such as file types. Today, unstructured search blends into other categories. But unstructured search has less perceived value than flashier types of search or a back office ERP (enterprise resource planning) application. Navigate to ArnoldIT.com and run a query in my site’s search box. That’s an unstructured search, provided by Blossom Software, which is quite interesting to me.
Hyperbolic search. There are many variations of this approach which is called “buzzword fog”. Hyperbolic geometry and modular forms play an important role is some vendors’ systems. But these functions are locked away out of sight and fiddling by licensees. When you hear terms other than plain English, you are in the presence of “fog rolling in on little cat’s feet.” The difference is that this fog doesn’t move on. You are stuck in an almost-impenetrable mist. When you see the collision coming, it is almost always too late to avoid. I think the phrase means, “Our engineers use stuff I don’t understand, but it sure sounds good.”
Intuitive search. This is a term used to suggest that the interface is easy enough for the marketer’s mother to use without someone telling her what to do. The interface is one visible piece of the search system itself. Humans like to look at interfaces and debate which color or icon is better for their users. Don’t guess on interfaces. Test different ones and use what gets the most clicks. Interfaces that generate more usage are generally better than interfaces designed by the senior vice president’s daughter who just graduated with an MFA from the University of Iowa. Design opinion is not search; it’s technology decoration. For an example, look at this interface from Yahoo. Is it intuitive to you?
Real-time search. This term means that the content is updated frequently enough to be perceived as real time. It’s not. There is latency in search systems. The word “search,” therefore, doesn’t mean real-time by definition. Feed means “near real time”. There are a lot of tricks to create the impression of real time. These include multiple indexes, caching, content boosting, and time stamp fiddling. Check out ZapTXT. Next compare Yahoo News, AllTheWeb.com news, and Google News. Okay, which is “real time”? Answer: none.
Audio, video, image search. The idea is that a vendor indexes a particular type of non-text content. The techniques range from indexing only metadata and not the information in the binary file to converting speech to ASCII, then indexing the ASCII. In Japan, I saw a demonstration of a system that allowed a user to identify a particular image — for example, a cow. The system then showed pictures the system thought contained cows. These type of search systems address a real need today. The majority of digital content is in the form of digitized audio, video, and image files. Text is small potatoes. We don’t do a great job on text. We don’t do very well at all on content objects such as audio, video, and images. I think Blinkx does a reasonably good job, not great, reasonable.
Local search. This is a variation on vertical search. Information about a city or particular geographic area is indexed and made available. This is Yellow Pages territory. It is the domain of local newspaper advertising. A number of vendors want to dominate this sector; for example, Google, Microsoft, and Yahoo. Incumbents like telcos and commercial directory firms aren’t sure what actions to take as online sites nibble away at what was a $32 billion dollar paper directory business. Look at Ask City. Will this make sense to your children?
Intelligent search. This is the old “FOAI” or familiar old artificial intelligence. Most vendors uses artificial intelligence but call it machine learning or computational intelligence. Every major search engine uses computational intelligence. Try Microsoft’s Live.com. Now try Google’s “ig” or Individualized Google service. Which is relying more on machine learning?
Key word search. This is the ubiquitous, “naked” search box. You can use Boolean operators, or you can enter free text and perform a free text search. Free text search means no explicit Boolean operators are required of a user. Enlightened search system vendors add an AND to narrow the result set. Other system vendors, rather unhelpfully, add an OR, which increases the number of results. Take a look at the key word search from Ixquick, a New York City investment banker developed engine now owned by a European company. What’s it doing to your free text query?
Search without search. Believe me, this is where the action is. The idea is that a vendor — for example, Google — will use information about information, user behavior, system processes, and other bits and pieces of data — to run automatically and in the background, queries for a user. Then when the user glances at his / her mobile device, the system is already displaying the information most likely to be wanted at that point of time by that user. An easy way to think of this is to imagine yourself rushing to the airport. The Google approach would look at your geo spatial coordinates, check your search history, and display flight departure delays or parking lot status. I want this service because anyone who has ridden with me knows that I can’t drive, think about parking, and locate my airline reliably. I can’t read the keyboard on my mobile phone, so I want Google to convert the search result to text, call me, and speak the information as I try to make my flight. Google has a patent application with the phrase “I’m feeling doubly lucky.” Stay tuned to Google and its competitors for more information on this type of search.
This short list of different types of search helps explain why there is confusion about which systems do what. Search is no longer something performed by a person training in computer science, information science, or a similar discipline. Search is something everyone knows, right? Wrong. Search is a service that’s readily available and used by millions of people each day. Don’t confuse using an automatic teller machine with understanding finance. The same applies to search. Just because a person can locate information about a subject does not mean that person understands search.
Search is among the most complex problems in computer science, cognitive psychology, information retrieval, and many other disciplines. Search is many things, but it definitely is not easy, well understood, or widely recognized as the next application platform.
Stephen Arnold, January 30, 2008
Comments
7 Responses to “Search: The Problem with Words and Their Misuse”
This is very useful too: http://www.PolyCola.com Search Engine: Google, Yahoo, Live, Ask, AOL, Dogpile, Altavista…
As the search engines has become not only the source of finding useful information on the world wide web but they are also proving to be as a medium of advertisement to publishers they are not able to provide the accurate use of words . In the coming time the search engines will definitely play high roles in the growth of Networks and also overcomes the problems having in the present.
This article is very informative and you are well-informed about modern search techniques. The links are first rate too. Well done.
The insight you expressed about the use and misuse of the term is telling of the crux of the problem. I think a lot of developers. like Circos, for example, have cleverly designed interfaces that are attractive, functional and appealing. That on top of the choice of sources they are indexing is interesting and obviously marketable. Still, I think you would agree, the indexing is largely by keyword and the search techniques are matching strings in the index. This is the predominant search technique; stemming is used to help the coverage. Even information-theoretic search techniques boil down to a bag of words, so all of them suffer from the same malady of superficiality.
You must have heard of the independence assumption in information science. This is the cause of search affliction, in my opinion. And this is the source of the disconnect between what people are searching for and what search engines find for them.
If this continues to be the case, computers will not become intelligent allies in weeding through the stagnating pile of useless trivia. If the search is to be so robotic as to be useless, I wouldn’t want my computer calling me with stupid messages it guesses are interesting to me.
Also, of interest to you Stephan, might be another perspective on semantic search, one that is not tied to NLP. This approach is unique as it maps text using the semantics of interpersonal relationships. Take a look at my blog for more on that.
Finally, I think search is widely recognized as the next application platform. Google’s present portfolio and business and Microsoft’s and IBM’s obvious investments in their search products show that big business thinks that way. I think Google secretly enjoys that it is not easy. The big players have shown that you do not need to understand search to make money in advertising, you just need to field the product and be there for the advertisers. Faster is better, its a zipity-zip world. They all want you to buy right now, make your choice instantly.
I don’t see it changing soon. Critical thinking takes time. Understanding search takes time. Understanding your choices from a search engine takes time and exegesis. Hardly no one wants to take the time for that.
-Ken Ewell
Beyond Search and Search…
As many of you know from our press release at Gilbane Boston, two of the reports we will be publishing in the next few of months have to do with search. Lynda Moulton, who runs our Enterprise Search consulting practice……
[…] Original post by Stephen E. Arnold and posted by Alfred Moya […]
Search Behind the Firewall aka Enterprise Search…
A search engine does not need to be exclusive of all other search engines, nor must it be deployed to crawl and index every single repository in its path to be referred to as enterprise search. There are good and justifiable reasons to leave select rep…
I think it is possible to do “real” real-time search, question is will it be sufficiently valuable vs near-real-time search?