Study Spam: Caveat Emptor

January 31, 2008

My interest in Google is evident from my speeches, articles, and studies about the company. Not surprisingly, I receive a number of emails, telephone calls, and snail mail solicitations. Most of these are benign. I respond to the email, take the call, and toss the snail mail junk. Most people are hoping that I will review their work, maybe engage in some Google gossip, or try to sell me their studies or expertise. No problem 99 percent of the time. I post my telephone number and email (I know it’s not a great idea), but most people are courteous to me and respectful of my requests.

Recently I have been getting email solicitations from an entity known as Justyna Drozdzal from Visiongain Intelligence. The email address is justyna.drozdzal at Maybe you will have better luck interacting with this outfit? I certainly haven’t had any success getting Ms. Drozdzal to stop sending me email solicitation about a “new” study about Google’s Android. The study’s title is come-hither scented Google’s Android and Mobile Linux Report 2008: A Google-Led Initiative to Reshape the Mobile Market Environment. I didn’t think Google “led” much of anything, but that’s my Kentucky silliness surfacing. I accept Google’s “controlled chaos” theory of building its business. (This is the subject of my April 2008 column in KMWorld where some of my “official” work appears.)

The Visiongain study is about, according to the information available to me, Google’s Android — sort of. I did review the table of contents, and I drew three conclusions, but I urge you to make your own decision, not accept the opinion of a geek in rural Kentucky with the squirrels and horses. Here’s what I decided:

First, the report seems to recycle information available on Google’s own Web site and from other readily available open sources. Summaries are useful. I do them myself, but my summaries have not been able to do much to clear the shroud of secrecy that surrounds Google’s intentions. When it comes to telco, I think the Google has “won”. The company has destabilized traditional telecom operations in the US, floated this open platform notion, and managed to round up a couple dozen partners. I’m reasonably confident that Google knows exactly what it will do — be opportunistic. That’s because Google reacts to clicks and follows the data. A company with Google’s Gen-X fluidity defies conventional wisdom and makes predictions about Google wobbly. Unlike my uncertainty, Visiongain has figured out Google, Android, and telecommunications. Impressive.

Second, the study appears to have some chunks of other Visiongain reports. such as the use of Linux on mobile devices. Again, my spot checks suggested that most of the information is available with routine queries passed to metasearch engines like Dogpile or consumer-oriented Web search services such as Microsoft or Yahoo. Maybe I’m jaded or plain wrong, but “experts” and “high end consultancies” with names that convey so much more meaning than have been quick to exploit Google – mania. Is Visiongain guilty of this? I’m not sure.

Third, the study seems to suggest that Visiongain is not baffled by Google’s telecommunications’ strategy. Is Dell making a Google phone? Is Apple the next Google target? Will Google build a telecommunications company? I was unable to see through Google’s and the Web’s Sturm und Drang. My conclusion: Remove my telecommunications analysis from my study Google Version 2.0.
Now, back to Visiongain. My thought is that spamming me to buy a report in an area in which I have conducted research for several years is silly. When I asked the company to stop, I received another missive from Ms. Drozdzal asking me to buy the study. That was even sillier. My personal opinion is that this particular Visiongain report seems to be more fool’s gold than real gold.

To cite one example, there is chatter circulating that Google and Dell have teamed to produce a Google phone. Not surprisingly, neither Dell Computer nor Google is talking. Similarly, there are rumors that Google continues to bid for spectrum, and there are rumors that Google is not serious. Until the actual trajectory of these Google mobile activities is made clear by Google itself, I think spending money on reports from companies purporting to know how Google will “reshape the mobile market environment” is not for me. You make your own decision about how to spend your money.

Stephen Arnold, January 31, 2008

Autonomy: Right but for the Wrong Reasons

January 30, 2008

I try to turn a blind eye to the PR chaff wafted by software companies. An Ovum story caught my eye on January 30, 2008, as I waited for another dose of red-eye airline abuse. As I sat in Seattle’s airport, I started to think about the Ovum article “Sub-Prime Will Provide Two-Year Boost for Autonomy.” I realized that Autonomy would have a strong 2008, but I did not agree that the “sub-prime issue” would be the driver of Autonomy’s juicy 2008 revenue.

As I shivered in the empty departure hall, I replayed in my aging mind some of the conversations I had in the previous 36 hours. Let me be clear: I think Ovum does good work. As a former consultant at a blue-chip firm, I do understand the currents and eddies of maintaining client relationships, seeming “smart” on crucial issues, and surfing on big, intellectual waves.

I think Autonomy is a very good information platform. But as readers of this Web log know, I think search and content processing is a tough business, so everyone can and should improve. So, I’m okay with Autonomy IDOL and its various moving parts.

Autonomy’s Reasoning for a Great 2008

What I want to do is quote a snippet of the Ovum essay. Please, read the original. I cannot do justice to the Ovum wordsmithing. Here’s a portion that I found interesting, almost like a melody that keeps rattling around my mind when I’m trying to relax:

“After declaring record earnings for the quarter and the year, Mike Lynch CEO of Cambridge, UK-based Autonomy said in addition to a good pipeline for 2008 he is expecting a positive bounce resulting from the US sub-prime debacle through banks adopting Autonomy’s technology.”

Then a few sentences further on in the Ovum essay, I read:

“When describing the year ahead Lynch highlighted that the fall-out of the sub-prime issue in the US is that banks were having to secure and analyse very large amounts of disparate information in a very short timescale, exactly where Autonomy positions its Meaning Based Computing message, and reflected in the recently announced $70m deal … with a global bank. To illustrate how significant the demand being seen by Autonomy was, [Sir Michael] Lynch stated that the cycle time for that deal was two months, and that the company was in the first instance moving people across to support the sales and deployment activity, and it would be supporting its partners to undertake the work in the near future.”

I think I follow this line of reasoning, but it doesn’t strike to the heart of what may well be Autonomy’s best revenue-generating year in the firm’s history. Let me tell you what I think I heard, and you judge what’s more important: financial crises or the information you are about to read.

Alternate Reasoning for a Great 2008

In my chit-chats — all off the record and without attribution — in Seattle I picked up two pieces of what may be reliable information. I urge you to verify my information before concluding that I know what I am talking about. I invite you to provide corrections, additions, or emendations to these points. I had not heard these two ideas expressed before, and I find them thought provoking. I find looking at events from different viewpoints helpful. Here are the two pieces of information. Proceed at your own risk.

First point: the Microsoft deal was done by the firm’s Office group’s senior leadership. That unit concluded that it needed to take a bold step. The Fast Search & Transfer acquisition made sense because it delivered revenue (~$150 – $200 million), 2,000 customers, lots of technology, and smart people. The deal was pushed forward in the period between Thanksgiving and the New Year. When the news broke, some inside Microsoft were surprised. I thought I heard something along the lines: “What? We own Fast Search & Transfer”.

Second point: When other units of Microsoft started pooling their knowledge of Fast Search & Transfer, there was concern that the guts of Fast Search did not share Microsoft’s DNA or the Microsoft “agenda”. Fast Search has a SharePoint adaptor, but the rest of the technology looked like Amazon, Google, or Yahoo “stuff”. I thought I heard: “That is going to be an interesting integration chore for the Office group.”

When I hear to word interesting, my ears quiver just like my dog Tyson’s when he hears me open the treat jar. Interesting can mean good things or bad things, but rarely dull things. I have some experience with Microsoft frameworks and some with Fast Search’s ESP (Enterprise Search Platform). Integrating these two frameworks is not something I could do. I’m too old and slug-like for super-wizardary.

Back to 2008

How do these two unsubstantiated pieces of information relate to Autonomy?

What I think is that think Autonomy will win in those head-to-head competitions where Autonomy must sell against Fast Search & Transfer (maybe Micro-Fast?). I think that in large account face-offs, procurement teams will not know how the merger will play out. Uncertainty about the future may tip the scales in favor of Autonomy. The company is stable, not at this moment being acquired, and has oodles of mostly happy customers. Its has name recognition. The Microsoft – Fast team can only offer assurances that everything will be okay. In my view, Autonomy can go beyond okay and will, therefore, win most deals.

But many search procurements involve three or more vendors. In those situations, Autonomy will win some and lose some. So their win rate won’t be much different from what it was in 2007, which as I understand the financial reports, continued to nose upwards. With or without the Microsoft – Fast deal, Autonomy has been charging forward.
Financial factors seem tangentially significant, but Autonomy’s golden goose 2008 is going to be attributable to the Microsoft – Fast merger.

The Microsoft – Fast Options (Hypothetical, Speculative, Thought Exercise)

Let’s assume that I am right and consider as a thought experiment what Microsoft can do to win sales from Autonomy and keep Fast Search’s revenues on the trail to financial health. Here are three possible scenarios that seem insightful to me in the Seattle airport at 12:15 am Pacific time. I invite you to weigh in. Attorneys, consultants, and share churners can climb too. This is an essay, an attempt, not much more than my opinion based on the aforementioned, unsubstantiated chit-chat. Okay? Now the options to consider:

  1. Microsoft – Fast leaves the two platforms separate. Microsoft provides management expertise, leadership, and marketing horsepower and goes flat out to wrest business from Autonomy in its key accounts. Microsoft ignores Autonomy’s response (price cutting, PR chaff, etc.,) and uses Microsoft billions to neuter Autonomy, Virage, and any other search technology Autonomy offers.
  2. Microsoft – Fast hunkers down, integrates the two platforms. Using its reseller and Certified Partner network, Microsoft – Fast combines a better search solution plus slick technology plus high-powered marketing. Although Microsoft concedes some battles, when the company hits the street with its offering, it executes a Netscape-type (think free or really low license fees) strategy and becomes the dominant player in the behind-the-firewall search market.
  3. Microsoft – Fast does the integration work well. Instead of fighting Autonomy, Microsoft uses a variation of its customer relationship management strategy. Free trials and low cost introductory rates are used to get existing Microsoft-centric customers to remain loyal to Microsoft. Applied globally for a year or more, Autonomy may be slowly deprived of oxygen. Autonomy loses its agility, becomes weaker, and fades into a distant second place behind the Microsoft super-platform.

As I think about these hypothetical scenarios, I see a win for Microsoft in any of these paths. If the company were to mix and match strategies — for example, all-out assault and long-term oxygen deprivation — Autonomy would have its cash reserves depleted. The end game, of course, is that another super-platform steps forward to acquire Autonomy. Then two super-platforms would fight for the behind-the-firewall search customers.

Who are candidate super-platforms at a time when the US economy is teetering toward a recession? Here’s my shortlist, and may I ask, “Who are your candidates in this high-stakes poker game?

  • Oracle. This company already inked a deal with Google to hawk the Google Search Appliance. Oracle bought Triple Hop, but so far has not been able to leverage that technology. Mr. Ellison is a buyer, and I hear that Oracle has looked closely at Autonomy in the past.
  • SAP. This company has the TREX search system. A shiny new search system is on the horizon. Buying Autonomy brings several thousand customers, revenue, and engineers. Microsoft tried to buy SAP, and SAP fought back. Maybe this is the next step for SAP?
  • An investment bank — maybe Carlyle Group, an outstanding outfit. Carlyle could work a deal to convert Autonomy into several companies and start selling various units off to the higher bidder. There’s real money in buy outs and break ups.
  • IBM. IBM has more search solutions than any other vendor I track. Buying Autonomy brings customers and revenue. IBM then implements one of the Microsoft options and goes after Microsoft. IBM still remembers the great business relationship IBM enjoyed with Microsoft in the DOS and OS/2 era.

Note that none of these hypotheticals is greatly influenced by the sub-prime tempest. The stakes are now sufficiently high in behind-the-firewall search to make secondary forces — well — secondary. I don’t want to disagree with Ovum, but I think my analysis may add some useful “color” as the financial analysts like to say, to their look at Autonomy.

Stephen Arnold, January 31, 2008

Search: The Problem with Words and Their Misuse

January 30, 2008

I rely on several different types of alerts, including Yahoo’s service, to keep pace with developments in what I call “behind the firewall search”.

Today was particularly frustrating because the number of matches for the word “search” has been increasing, particularly since the Microsoft – Fast Search & Transfer acquisition and the Endeca cash injection from Intel and SAP. My alerts contain a large number of hits, and I realized that most of these are not about “behind the firewall” search, nor chock full of substantive information. Alerts are a necessary evil, but over the years, the primitive key word indexing offered by free services don’t help me.

The problem is the word search and its use or misuse. If you know of better examples to illustrate these types of search, please, post them. I’m interested in learning about sites and their search technology.

I have a so-so understanding of language drift, ambiguity, and POM (plain old marketing) work. For someone looking for information about search, the job is not getting easier. In fact, search has become such a devalued term that locating information about a particular type of search requires some effort. I’ve just finished compiling the Glossary for “Beyond Search”, due out in April 2008 from the Gilbane Group, a high-caliber outfit in the Boston, Massachusetts area. So, terminology is at the top of my mind this morning.

Let’s look at a few terms. These are not in alphabetical order. The order is by their annoyance factor. The head of the list contains the most annoying terms to me. The foot of the list are terms that are less offensive to me. You may not agree. That’s okay.

Vertical search. Number one for 2008. Last year it was in second place. This term means that a particular topic or corpus has been indexed. The user of a vertical search engine like sees only hits in the travel area. As Web search engines have done a better and better job of indexing horizontal content — that is, on almost every topic — vertical search engines narrow their focus. Think deep and narrow, not wide and shallow. As I have said elsewhere, vertical search is today’s 20-somethings rediscovering how commercial databases handled information in the late 1970s with success then but considerably less success today.

Search engine marketing. This is last year’s number one. Google and other Web engines are taking steps to make it harder to get junk sites to the top of a laundry list of results. This phrase search engine marketing is the buzzword for the entire industry of getting a site on the first page of Google results. The need to “rank high” and has made some people “search gurus”. I must admit I don’t think too much of SEM, as it is called. I do a reasonable job of explaining SEM in terms of Google’s Webmaster guidelines. I believe that solid content is enough. If you match that with clean code, Web indexing bots will index the information. Today’s Web search systems do a good job of indexing, and there are value-added services such as that add metadata, whether the metadata exists on the indexed sites or not. When I see the term search used to mean SEM, I’m annoyed. Figuring out how to fool Google, Microsoft, or Yahoo’s indexing systems is not something that is of much interest to me. Much of the SEM experts’ guidance amounts to repeating Google’s Web master guidelines and fiddling with page elements until a site moves up in the rankings. Most sites lack substantive content and deserve to be at the bottom of the results list. Why do I want to have in my first page of results a bunch of links to sites without heft? I want links to pages significant enough to get to the top of results list because of solid information, not SEM voodoo. For basics, check out “How Stuff Works.”

Guided, faceted, assisted, and discovery search. The idea that is difficult to express in words and phrases is a system that provides point-and-click access to related information. I’ve heard a variation on these concepts expressed as drill-down search or exploratory search. These are 21st-century buzzwords for “Use For” and “See Also” references. But by the time a vendor gets done explaining taxonomies, ontologies, and controlled term lists, the notion of search is mired in confusion. Don’t get me wrong. Rich metadata and exposed links to meaningful “See Also” and “Use For” information is important. I’m just burned out with companies using these terms when their technology can’t deliver.

Enterprise search. I do not know what “enterprise search” is. I do know that there are organizations of all types. Some are government agencies. Some are non-profit organizations. Some are publicly-traded companies. Some are privately held companies. Some are professional services corporations. Some are limited liability corporations. Each has a need to locate electronic information. There is no one-size-fits-all content processing and retrieval system. I prefer the phrase “behind the firewall search.” It may not be perfect, but it makes clear that the system must function in a specific type of setting. Enterprise search has been overused, and it is now too fuzzy to be useful from my point of view. A related annoyance is the word “all”. Some vendors say they can index “all the organization’s information.” Baloney. Effective “behind the firewall” systems deliver information needed to answer questions, not run afoul of federal regulations regarding health care information, incite dissatisfaction by exposing employee salaries, or let out vital company secrets that should be kept under wraps.

Natural language search. This term means that the user can type a question into a system. A favorite query is, “What are the car dealerships in Palo Alto?” You can run this query on Google or The system takes this “natural language question”, coverts it to Boolean, and displays the results. Some systems don’t do anything more than display a cached answer to a frequently asked question. The fact is that most users–exceptions include lawyers and expert intelligence operatives–don’t do “natural lanaguage queries”. Most users type some words like weather 40202 and hit the Enter key. NLP sounds great and is often used in the same sentence with latent semantic indexing, semantic search, and linguistic technology. These are useful technologies, but most users type their 2.3 words and take the first hit on the results list.

Semantic search. See natural language search. Semantic technologies are important and finally practical in every day business operations. Used inside search systems, today’s fast processors and cheap storage make it possible to figure out some nuances in content and convert those nuances to metatags. It’s easy for vendors to bandy about the term semantic and Semantic Web than explain what it delivers in terms of precision and recall. There are serious semantic-centric vendors, and there are a great many who use the phrase because it helps make sales. An important vendor of semantic technology is Siderean Software. I profile others in “Beyond Search”.

Value-added search. This is a coinage that means roughly, “When our search system processes content, we find and index more stuff.” “Stuff”, obviously, is a technical word that can mean the file type or concepts and entities. A value-added search system tries to tag concepts and entities automatically. Humans used to do indexing but there is too much data and not enough skilled indexers. So, value-added search means “indexing like a human used to do.” Once a result set has been generated, value-added search systems will display related information; that is, “See Also” references. An example is Internet the Best. Judge for yourself if the technique is useful.

Side search. I like this phrase. It sounds nifty and means nothing to most people in a vendor’s marketing presentation. What I think the vendors who use this term mean is additional processes that run to generate “Use For” and “See Also” references. The implication is that the user gets a search bonus or extra sugar in their coffee. Some vendors have described a “more like this” function as a side search. The idea is that a user sees a relevant hit. By clicking the “more like this” hot link, the system uses the relevant hit as the basis of a new, presumably more precise, query. A side search to me means any automatic query launched without the user having to type in a search box. The user may have to click the mouse button, but the heavy lifting is machine-assisted. Delicious offers a side search labeled as related terms. Just choose a tag from the list of the right side of the Web page, and you see more hits like these. The idea is that you get related information without reentering a query.

Sentiment search. I have just looked at a new search system called Circos. This system lets me search in “color”. The idea is that emotions or feeling can be located. People want systems that provide a way to work emotion, judgment, and nuance into their results. Lexalytics, for examples, offers a useful, commercial system that can provide brand managers with data about whether customers are positive or negative toward the brand. Google, based on their engineering papers, appears to be nosing around in this sentiment search as well. Worth monitoring because using algorithms to figure out if users like or dislike a person, place, or thing can be quite significant to analysts.

Visual search. I don’t know what this means. I have seen the term used to describe systems that allow the user to click on pictures in order to see other pictures that share some colors or shapes of the source picture. If you haven’t seen Kartoo, it’s worth a look. Inxight Software offers a “search wall”. This is a graphic representation of the information in a results list or a collection as a three-dimensional brick wall. Each brick is a content object. I liked the idea when I first saw in five or six years ago, but I find visual search functionality clunky. Flying hyperbolic maps and other graphic renderings have sizzle, but instead of steak I get boiled tofu.

Parametric search. Structured search or SQL queries with training wheels are loose synonyms for parametric search and close enough for horse shoes. The term parametric search has value, but it is losing ground to structured search. Today, structured data are fuzzed with unstructured data by vendors who say, “Our system supports unstructured information and structured data.” Structured and unstructured data treated as twins, thus making it hard for a prospect to understand what processes are needed to achieve this delightful state. These data can then be queried by assisted, guided, or faceted search. Some of the newer search systems are, at their core, parametric systems. These systems are not positioned in this way. Marketers find that customers don’t want to be troubled by “what’s under the hood.” So, “fields” become metatags, and other smoothing takes place. It is no surprise to me that content processing procurement teams struggle to figure out what a vendor’s system actually does. Check out Thunderstone‘s offering and look for my Web log post about parametric (structured search) in a day or two. In Beyond Search, I profile two vendors’ systems each with different but interesting parametric search functionality. Either of these two vendors’ solutions can help you deal with the structured – unstructured dichotomy. You will have to wait until April 2008 when my new study comes out. I’m not letting these two rabbits out of my hat yet.

Unstructured search. This usually implies running a query against text that has been indexed for its key words because the source lacks “tags” or “field names”. Email, PDFs, and some Word documents are unstructured. A number of content processing systems can also index bound phrases like “stock market” and “white house”. Others include some obvious access points such as file types. Today, unstructured search blends into other categories. But unstructured search has less perceived value than flashier types of search or a back office ERP (enterprise resource planning) application. Navigate to and run a query in my site’s search box. That’s an unstructured search, provided by Blossom Software, which is quite interesting to me.

Hyperbolic search. There are many variations of this approach which is called “buzzword fog”. Hyperbolic geometry and modular forms play an important role is some vendors’ systems. But these functions are locked away out of sight and fiddling by licensees. When you hear terms other than plain English, you are in the presence of “fog rolling in on little cat’s feet.” The difference is that this fog doesn’t move on. You are stuck in an almost-impenetrable mist. When you see the collision coming, it is almost always too late to avoid. I think the phrase means, “Our engineers use stuff I don’t understand, but it sure sounds good.”

Intuitive search. This is a term used to suggest that the interface is easy enough for the marketer’s mother to use without someone telling her what to do. The interface is one visible piece of the search system itself. Humans like to look at interfaces and debate which color or icon is better for their users. Don’t guess on interfaces. Test different ones and use what gets the most clicks. Interfaces that generate more usage are generally better than interfaces designed by the senior vice president’s daughter who just graduated with an MFA from the University of Iowa. Design opinion is not search; it’s technology decoration. For an example, look at this interface from Yahoo. Is it intuitive to you?

Real-time search. This term means that the content is updated frequently enough to be perceived as real time. It’s not. There is latency in search systems. The word “search,” therefore, doesn’t mean real-time by definition. Feed means “near real time”. There are a lot of tricks to create the impression of real time. These include multiple indexes, caching, content boosting, and time stamp fiddling. Check out ZapTXT. Next compare Yahoo News, news, and Google News. Okay, which is “real time”? Answer: none.

Audio, video, image search. The idea is that a vendor indexes a particular type of non-text content. The techniques range from indexing only metadata and not the information in the binary file to converting speech to ASCII, then indexing the ASCII. In Japan, I saw a demonstration of a system that allowed a user to identify a particular image — for example, a cow. The system then showed pictures the system thought contained cows. These type of search systems address a real need today. The majority of digital content is in the form of digitized audio, video, and image files. Text is small potatoes. We don’t do a great job on text. We don’t do very well at all on content objects such as audio, video, and images. I think Blinkx does a reasonably good job, not great, reasonable.

Local search. This is a variation on vertical search. Information about a city or particular geographic area is indexed and made available. This is Yellow Pages territory. It is the domain of local newspaper advertising. A number of vendors want to dominate this sector; for example, Google, Microsoft, and Yahoo. Incumbents like telcos and commercial directory firms aren’t sure what actions to take as online sites nibble away at what was a $32 billion dollar paper directory business. Look at Ask City. Will this make sense to your children?

Intelligent search. This is the old “FOAI” or familiar old artificial intelligence. Most vendors uses artificial intelligence but call it machine learning or computational intelligence. Every major search engine uses computational intelligence. Try Microsoft’s Now try Google’s “ig” or Individualized Google service. Which is relying more on machine learning?

Key word search. This is the ubiquitous, “naked” search box. You can use Boolean operators, or you can enter free text and perform a free text search. Free text search means no explicit Boolean operators are required of a user. Enlightened search system vendors add an AND to narrow the result set. Other system vendors, rather unhelpfully, add an OR, which increases the number of results. Take a look at the key word search from Ixquick, a New York City investment banker developed engine now owned by a European company. What’s it doing to your free text query?

Search without search. Believe me, this is where the action is. The idea is that a vendor — for example, Google — will use information about information, user behavior, system processes, and other bits and pieces of data — to run automatically and in the background, queries for a user. Then when the user glances at his / her mobile device, the system is already displaying the information most likely to be wanted at that point of time by that user. An easy way to think of this is to imagine yourself rushing to the airport. The Google approach would look at your geo spatial coordinates, check your search history, and display flight departure delays or parking lot status. I want this service because anyone who has ridden with me knows that I can’t drive, think about parking, and locate my airline reliably. I can’t read the keyboard on my mobile phone, so I want Google to convert the search result to text, call me, and speak the information as I try to make my flight. Google has a patent application with the phrase “I’m feeling doubly lucky.” Stay tuned to Google and its competitors for more information on this type of search.

This short list of different types of search helps explain why there is confusion about which systems do what. Search is no longer something performed by a person training in computer science, information science, or a similar discipline. Search is something everyone knows, right? Wrong. Search is a service that’s readily available and used by millions of people each day. Don’t confuse using an automatic teller machine with understanding finance. The same applies to search. Just because a person can locate information about a subject does not mean that person understands search.

Search is among the most complex problems in computer science, cognitive psychology, information retrieval, and many other disciplines. Search is many things, but it definitely is not easy, well understood, or widely recognized as the next application platform.

Stephen Arnold, January 30, 2008

Vivisimo’s Remix

January 29, 2008

I’ve been interested in Vivisimo since I learned about the company in 2000. Disclaimer: my son worked for Vivisimo for several years, and I was involved in evaluating the technology for the U.S. Federal government. A new function, called “Remix“, caught my attention and triggered this essay.


Carnegie Mellon University ranks among the top five or six leading universities in computer science. Lycos was a product of the legendary Fuzzy and his team. Disclaimer: my partner (Chris Kitze) and I sold search technology to Lycos in the mid-1990s. Dr. David Evans has practiced his brand of innovation with several successful search-centric start ups, including a chunk of the technology now used in JustSystems‘ XML engine. (Disclaimer: I have done some work for JustSystems in Tokyo, Japan.) Vivisimo, founded by Raul Valdes-Perez and Jerome Pesenti, was among the first of the value-added processing search systems. I have been paying attention to Vivisimo for more than a decade.

I’ve been impressed with Vivisimo’s innovations, and I have appropriated Mr. Valdes-Perez’s coinage, “information overlook” in my verbal arsenal. As I understand the term, “overlook” is a way for a person looking for information is a way to get a broader view of the information in the results list. I think of it in terms of standing on a bluff and being able to see the lay of the land. As obvious as an overlook may be, it is a surprisingly difficult problem in information retrieval. You’ve heard the expression “We can’t tell the forest from the trees,”. Information overlook attempts to get the viewer into a helicopter. From that vantage point, it’s easier to see the bigger picture.

A Demonstration Query

Vivisimo’s technology has kept that problem squarely in focus. With each iteration and incremental adjustment to the Vivisimo technology, overlook has been baked in to the Vivisimo approach to search-and-retrieval. Here’s an example.

Navigate to, Vivisimo’s public facing search system. Note that Clusty is a metasearch system. Your query is passed to other search systems such as and Yahoo. The results are retrieved and processed before you see them. Now enter the query ArnoldIT. You will see a main results page and a list of folders in the left hand column of your screen. You can browse the main results. Note that Vivisimo removes the duplicates for you, so you are looking at unique items. Now scan the folder names.

Those names represent the main categories or topics in that query’s result list. For ArnoldIT, you can see that my Web site has information about patents, international search, and so on. Let me highlight several points about the foundation of Vivisimo:

First, I’ve been impressed with Vivisimo’s on-the-fly clustering. It’s fast, unobtrusive, and a very useful way to get a view of what topics occur in a query’s result set. I use Vivisimo when I begin a research project to help me understand what topics can be researched via the Web and which will require the use of analysts making telephone calls.

Second, in the early days of online, deduplication was impossible. Dialog and Orbit, two of the earliest online systems, manipulated fielded flat files. A field name variation make it computationally expensive to recurse through records to identify and remove duplicate entries. When I was paying for results from commercial online sysetms, these duplicates cost me money. When I learned about Vivisimo’s duplicate detection function, I looked at it closely. No one at Vivisimo would give me the details of the approach, but it worked and still works well. Other systems have introduced deduplication, but Vivisimo made this critical function a must-have.

Third, Vivisimo’s implementation of metasearch remains speedy. There are a number of interesting approaches to metasearch, including the little-known system developed by a brother and sister team working in the south of France. I also admire the Devilfinder search engine that is now one of the faster metasearch systems available. But in terms of features, Vivisimo ranks at the top of the list, easily outperforming ixquick, Dogpile, and other very useful tools.

Fourth, like Exalead, Vivisimo has been engineered using the Linux tricks of low-cost scaling and clustering for high performance. These engineering approaches are becoming widely known, but many of these innovations originated at Stanford, Uniersity of Waterloo, MIT, and Carnegie Mellon University.

The Shift to the Enterprise

Three years ago, Vivisimo made the decision to expand its presence in organizations. In effect, the company wanted to move from a specialist provider of clustering technology to delivering behind-the-firewall search. When Vivisimo’s management told me about this new direction, I explained that the market for behind-the-firewall search was a contentious, confused sector. Success would require more marketing, more sales professionals, and a tougher hide. Mr. Valdes-Peres looked at me and said, “No problem. We’re going to do it.”

The company’s first high-profile win was the contract for indexing the U.S. Federal government’s unclassified content. This contract was originally held by Inktomi in 2000 to 2001. Then Fast Search & Transfer with its partner AT&T held the contract from 2001 to 2005. When Vivisimo displaced Fast Search’s technology, the company was in a position to pursue other high-profile search deals.

Today, Vivisimo is one of the up-and-coming vendors of behind-the-firewall search solutions. I have learned that the company has just won another major search deal. I’m not able to reveal the name of the new client, but the organization touches the scientific and technical community worldwide. Based on my understanding of the information to be processed, Vivisimo will be making the research work of most US scientists and engineers more productive.


This essay is a direct result of my learning about a new Vivisimo function, Remix. You can use the remix function when you have a result set visible in your results display. In our earlier sample query, ArnoldIT, you see the top 10 topics or clusters of results for that query. When you select Remix, the system, according to Vivismo, “With a single click, remix clustering answers the question: What other, subtler topics are there? It works by clustering again the same search results, but with an added input: ignore the topics that the user just saw. Typically, the user will then see new major topics that didn’t quite make the final cut at the last round, but may still be interesting.”

The function is important for three reasons:

First, Vivisimo has made drill down easy. Some systems perform a similar function, but the user is not always aware of what’s happened or where the result list originated. Vivisimo does a good job of keeping the user in control and aware of his / her location in the results review sequence.

Second, Remix allows one-click access to categories that otherwise would not be seen by the Clusty user. The benefit of Remix is that the result sets do not duplicate any topics the user saw before clicking the Remix button. Just as Vivisimo’s original deduplication function worked invisibly, so does Remix. The function just happens.

Third, the function is speedy. Vivisimo has a number of innovations in its system to make on-the-fly processing of search results take place without latency–the annoying delays some systems impose upon me. Vivisimo’s value-added processing occurs almost immediately. Like Google, Vivisimo has focused on delivering fast response time and rocket science for the busy professional.

Some Challenges

Companies like Vivisimo will have to deal with the marketing challenges of today’s search-and-retireval marketplace. The noise created by Microsoft’s acquisition of Fast Search and Endeca‘s injection of cash from Intel and SAP means that interesting companies like Vivisimo have to make themselves known. I don’t envy the companies trying to get traction is the search sector.

If you are looking for a behind-the-firewall system, you will want to take a look at Vivisimo’s system. In fact, you will want to spend additional time reviewing the search solutions available from the up-and-comers I profile in my new study “Beyond Search”, due out in April 2008. You will find that you can deliver a robust solution without the teeth-ratting licensing fees required by some of the higher-profile vendors.

I can’t say that any one search system will be better for you than another. In fact, when you compare ISYS Search Software, Siderean Software, and Exalead with Vivisimo, you may find that each is an exceptionally robust solution. Which system you find is best for you comes down to your requirements. The key point is that the up-and-coming systems must not be excluded from your short list because the companies are not making headlines on a daily basis.

If you have the impression that Vivisimo is not up to an enterprise-scale content processing job, you have flawed information. Give Vivisimo’s technology a test drive. Judge for yourself. I wrote about Vivisimo in the first, second, and third editions of The Enterprise Search Report. I won’t be repeating that information in Beyond Search. You can explore Vivisimo and learn more about the system from the company’s useful white papers and case studies.

Stephen E. Arnold, January 29, 2008

Will Search “Save eBay”?

January 29, 2008

The Monday, January 28, 2008, New York Times, contained a short item that originally appeared in Bits, the technology blog “updated all day at”

The article in question carries this provocative headline: “The Plan to Save eBay: Better Search.” The author is Saul Hansell, whose writing I admire. I was tickled to learn his Web log entry made the leap from the Times‘s Web site to its printed newspaper. This revealed two facts to me: [a] the editors read what’s on the New York Times‘s Web site, a very good sign, and [b] the Web log itself contained newsprint-worthy information.

I want to quote a small snippet in which John Donahoe, the new eBay boss, is the primary source of information for the story. (I urge you to read the original posting or newspaper article.) I added the bold because I want to reference some of these words in my discussion of eBay.

“Let’s say you wanted to buy a BlackBerry, he [Donahoe] said. Last time I [Donahoe] checked, we had 25,000 BlackBerry listings.This is a fairly confusing experience. A year from now, you will be able to say, ‘I want a BlackBerry. Boom. Show me the latest models at the cheapest price.’ The same screen, he added, will show used and older models as well. ‘We also want to surface the six-month old version, still brand new, that may be in an auction format because its value is less certain, he said.”

As I understand this statement, eBay is going to [a] improve search because getting 25,000 results is confusing, and I agree, [b] maybe support voice search because the phrase “you will be able to say” seems to suggest that functionality, and [c] eBay “will show used and older models as well”. The word “show” connotes some graphical representation or interface innovation to help users make sense of 25,000 BlackBerry listings.

Any one of these technology-centric fixes would be a lot of work and could take considerable time. A year seems too little time to get these innovations planned, procured, debugged, and online for customers.

I know from my search work that most users don’t feel comfortable with laundry lists of results. I generally look at the auctions closing within a few minutes or check out the “Buy It Now” products. I no longer rummage through eBay’s sorting and filtering functions. For me, those functions are too hard to find. I prefer Google’s approach to its “Sort by Price” function. I also like eCost’s sending an email with time-sensitive deals. I want what I want now with the least hassle, the lowest price at that moment, and the simplest possible interface. Let’s look at some of the words I highlighted in the article.

I’m puzzled by the notion of Mr. Donahoe’s use of the word “surface”. I am not sure about its meaning in this context. “Surface” makes me think of whales, as in “Save the whales.”

When Mr. Donahoe’s uses the word “say”, I thought of my mobile phone speech recognition function. I talk to my phone now, mostly unsuccessfully. I do use Google’s voice recognition service for 411, and it’s pretty good. My mobile phone has a small screen, and I can’t figure out how eBay will be able to display some of the 25,000 results so I can read them. I use the new Opera mobile browser. I don’t like its miniature rendering of a Web page. When I want to look at something, Opera uses a zoom function that is a hassle for me to use on my mobile’s Lilliputian keyboard. eBay has to do better than Opera’s interface.

Most of the gizmos I look for on eBay or Google’s shopping service come in quantities of a couple dozen if the product is even available. For example, I recently scoured the Web for a replacement fan for one of my aging Netfinity 5500 servers. Zero hits for me on eBay the day I ran the query. I fixed the fan myself. Last week, I tried to buy a Mac Mini on eBay, but I got a better deal through Craigslist.

Enough old-guy grumpiness.

I knew that eBay’s search system was and is a work in progress. Years ago, the eBay Web site carried a Thunderstone logo. I assumed that Thunderstone, a vendor of search systems, provided search technology to eBay. Then one day the little blue Thunderstone logo vanished. No one at Thunderstone would tell me what happened. Somewhere along the line, Louis Monier (a search wizard) joined eBay. Then he jumped to Google, and I don’t know who had to fill his very big shoes. I asked eBay to fill me in, but eBay’s team did not respond to me. I call this search churn. It’s expensive and underscores a lack of certainty about how to deliver a foundation service in my experience.

But I really wasn’t surprised at the lack of response to my email.

When I have a problem with an eBay vendor who snookers me, I have had time-consuming work to get to a human. Someone told me that I have a negative reputation score because I am a “bad buyer”. I don’t sell anything on eBay, but I suppose eBay rates customers who grouse when a sale goes out of bounds.

Search won’t fix a business model, customer support, and giving annoyed customers a grade of “D” in buying. To my way of thinking, search is not eBay’s only problem. Search is not eBay’s major problem.

One of my colleagues in San Francisco told me that eBay was reluctant to license his software because eBay’s system at that time was “flaky”. His word, not mine. At that time eBay was relying on Sun Microsystems’ servers. I was a Sun Catalyst reseller and a Sun cheerleader. I know that, in general, Sun boxes are reliable, fast, stable, and scalable when properly set up and resourced. Ignore Sun’s technical recommendations, and you will definitely have excitement on your hands. When I hear rumors of high-end systems being “flaky”, I’m inclined to believe that some technical and management mistakes were made. Either money or expertise is in short supply, so a problem gets a temporary fix, not a real fix.

After reading the New York Times‘s article, I asked myself, “Is eBay so sick it has to be saved?”

For example, I bought a watchband on eBay not long ago, and everything worked as I expected. I used the search engine to find “brown watchband 20mm”. I got a page or two of results. I picked a watchband, won the auction, and I paid via PayPal — actually tried to pay. I had registered a new credit card a few weeks before the purchase. Before I could consummate my purchase, I had to locate a secret four digit number printed next to a $2 eBay transaction on my last credit card statement. After coming home from a 18-day trip , I had a hefty credit card statement. Hunting down the secret code definitely put a hitch in my getalong, but I found the number after some hunting. About a week later my watch band arrived. I liked its hot pink color and its 18mm width. Yep, another eBay purchase that went awry. I lived with the error. I’ve learned that when I file a negative comment about a transaction, I get emails from the offending merchant asking me to revise my opinion. I don’t need busy work.

Now let’s think about this “save eBay” effort. When I was in Australia in November 2007, the Australian government said it would take action to protect the whales. I didn’t think this would help. There are more whale hunters than Australian patrols. The Pacific ocean is a big expanse. Whale hunters with radar, infrared sensors, Google Earth, and super-tech harpoons can find and kill whales more easily than Australians protective forces can find the whale killers.

If “save eBay” is like saving the whales, eBay has a thankless job to do. But just fixing search won’t save eBay. The business processes and the business model need some attention. eBay’s new president (the former blue-chip consultant) is putting in place a one-year program in which search plays a leading role. eBay is going to use its “closed transaction data” to be smarter about using those data to “provide the most relevant search experience.”

I am confident that a Bain consultant can deliver on his agenda. What bothers me is that I think his timeline lacks wiggle room. He has to clear some hurdles:

First, annoyed sellers who are looking for other ways to move products.

Second, annoyed buyers who look for other ways to move their goods.

Third, the “new” or “better” search system.

Fourth, increasingly complex security actions that remind me that maybe eBay is not as secure as I believed it to be.

If the new president can’t revivify eBay, we might be looking at an eAmazon or a Google-Bay. If eBay swings for a search home run, it won’t be enough. eBay has to make more informed decisions about its customers, security, sellers, and business model. Otherwise, eBay may be an ecommerce whale pursued by some hungry sushi lovers.

Stephen Arnold, January 28, 2008

Lucene: Merits a Test Drive and a Close Look

January 27, 2008

On Friday, I gave an invited lecture at the Speed School, the engineering and computer science department of the University of Louisville. After the talk on Google’s use of machine learning, one of the Ph.D. candidates asked me about Lucene. Lucene, as you may know, is the open source search engine authored by one of the Excite developers. If you want background on Lucene, the Wikipedia entry is a good place to start, and I didn’t spot any egregious errors when I scanned it earlier today. My interest is behind the firewall search and content processing. My comments, therefore, reflect my somewhat narrow view of Lucene and other retrieval systems. I told the student that I would offer some comments about Lucene and provide him with a few links.


Lucene’s author is Doug Cutting, who worked at Xerox’s Palo Alto Research Center and eventually landed at Excite. After Excite was absorbed into Excite@Home, he needed to learn Java. He wrote Lucene as an exercise. Lucene (his wife’s middle name) was contributed to the Apache project, and you can download a copy, documentation, and sample code here. An update — Java Version 2.3.0 — became available on January 24, 2008.

What It Does

Lucene permits key word and fielded search. You can use Boolean AND, OR, and NOT to formulate complex queries. The system permits fuzzy search, useful when searching text created by optical character recognition. You can also set up the system to display similar results, roughly the same as See Also references. You can set up the system to index documents. When a user requests a source document, that document must be retrieved over the local network. If you want to minimize the bandwidth hit, you can configure Lucene to store an archive of the processed documents. If the system processes structured content, you can search by the field tags, sort these results, and perform other manipulations. There is an administrative component which is accessed via a command line.
In a nutshell, you can use Lucene as a search and retrieval system.

Selected Features

You will want to have adequate infrastructure to operate the system, serve queries, and process content. When properly configured, you will be able to handle collections that number hundreds of millions of documents. Lucene delivers good relevancy when properly configured. Like a number of search and content processing systems, the administrative tools allow the search administrator to tweak the relevance engine. Among the knobs and dials you can twirl are document weights so you can boost or suppress certain results. As you dig through the documentation, you will find guidance for run time term weights, length normalization, and field weights, among others. A bit of advice — run the system in the default mode on a test set of documents so you can experiment with various configuration and administrative settings. The current version improves on the system’s ability to handle processes in parallel. Indexing speed and query response time, when properly set up and resourced, are as good as or better than some commercial products’ responsiveness.

Strengths and Weaknesses

The Lucene Web site provides good insight into the strengths and weaknesses of a Lucene search system. The biggest plus is that you can download the system, install it on a Linux, UNIX, or Windows server and provide a stable, functional key word and fielded search system. In the last three or four years, the system has made significant improvements in processing speed, reducing the size of the index footprint (now about 25 percent of the source documents’ size), incremental updates, support for index partitions, and other useful enhancements.

The downside of Lucene is that a non-programmer will not be able to figure out how to install, test, configure, and deploy the system. Open source programs are often quite good technically, but some lack the graphical interfaces and training wheels that are standard with some commercial search and content processing systems. You will be dependent on the Lucene community to help you resolve some issues. You may find that your request for support results in a Lucene aficionado suggesting that you use another open source tool to resolve a particular issue. You will also have to hunt around for some content filters, or you will be forced to code your own import filters. Lucene has not been engineered to deliver the type of security found in Oracle’s SES 11g system, so expect to spend some time making sure users can access only content at their clearance level.

When to Use Lucene

If you have an interest in behind-the-firewall search, you should download, install, and test the system. Lucene provides an excellent learning experience. I would have no hesitation installing Lucene in an organization where money for a brand name search system was not available. The caveat is that I am confident in my ability to download, install, debug, configure, and deploy the system. If you lack the requisite skills, you can still use Lucene. In September 2007, I met the founders of Tesuji, a company with business offices in Rome, Italy, and technical operations, in Budapest, Hungary. This company provides technical support for Lucene, customization services, and provides a build that the company has configured. Information about Tesuji is here. Another option is to download SOLR, which is a wrapper for Lucene. SOLR provides a number of features but the one that is quite useful is the Web-based administrative interface. When you poke under the SOLR hood, you will find tools to replicate indexes and perform other chores.

What surprises a number of people is the wide use of Lucene. IBM makes use of it. Siderean Software can include it in their system if the customer requires a search system as well as Siderean’s advanced content processing tools. Project Gutenberg uses it as well.
Some organizations have yet to embrace open source software. If you have the necessary expertise, give it a test drive. Compared to the $300,000 or higher first-year license fees some search and content processing vendors demand, Lucene is an outstanding value.

Stephen Arnold, January 27,2008

ESR, 4th Edition – The Definitive Atlas of Search

January 27, 2008

My colleague, Tony Byrne, publisher of The Enterprise Search Report, reminded me to inform you that the fourth edition of The Enterprise Search Report is available. It contains explanatory information essential for a successful search deployment, and you get 18 in-depth profiles of the leading vendors of enterprise search systems. I’m proud to have been part of the original study, which since 2004, has provided information about what’s turned out to be one of the hottest sectors in enterprise software. This new edition  builds upon my original work for the first, second, and third editions. The fourth edition contains information useful to system administrators, procurement teams, investors, and organizations involved in any way with search and retrieval. I strongly urge you to visit, look at the sample chapter available without charge, and then order your copy. A site license is available. If you are waiting for my new study Beyond Search, buy ESR. The two studies are complementary, and in Beyond Search I refer readers to the ESR so I don’t have to repeat myself. If you are serious about search, content processing, and value-adds to existing systems, you will find both studies useful.

Stephen Arnold, January 27, 2008

Up or Out? Probably Neither. Go Diagonal to the Future

January 27, 2008

I’ve been working on the introduction to Beyond Search. My fancy-dan monitor still makes my eyes tired. I’m the first to admit that next-generation technology is not without its weaknesses. To rest, I sat down and started flipping through the print magazines that accumulate each week.

Baseline is a Ziff Davis Enterprise magazine. I want you to know that I worked at Ziff Communications, and I have fond memories of Ziff Davis, one of Ziff’s most important units, at its peak. ZD’s products were good. Advertisers flocked to our trade shows, commercial online databases, and, of course, the magazines. I remember when Computer Shopper had so many pages, the printer complained because his binding unit wasn’t designed to do what he called “telephone books.” My recollection about that issue, which I saved for years, was a newsprint magazine with more than 600 pages that month. The Baseline I’m holding has 62 pages and editorial copy on the inside back cover, not an ad.

Baseline is a computer business magazine with the tagline “where leadership meets technology.” The Ziff of old was predicated on product reviews, product information, and product comparisons. This Baseline magazine doesn’t follow the old Ziff formula. Times change, and managers have to adapt. The original Ziff formula was right for the go-go years of the PC industry when ad money flowed to hard copy publications. It’s good that Baseline has a companion Web site. The information on the Web site is more timely than the articles in the print magazine, but maybe because of my monitor, I found the site difficult to read and confusing. Some of the news is timely and important; for example, Baseline carried the story about Google’s signing up the University of Phoenix, another educational scalp in its bag. That’s an important story largely unreported and not included in the Google News index. I like the idea of different, thoughtful approach to information technology. I also use the Baseline Web site.

The story in the January 2008 issue — “Scaling Up or Out” by David F. Carr — tackles an important subject. The question of how to scale to meet growing demand is one that many organizations now face. (I would provide a link to the article, but I could not locate it on the magazine’s Web site. The site lacks a key word search box, or I couldn’t find it. If you want to read the hard copy of this article, you will find it on pages 57, 58, 59, and 60.)

The subject addresses what IT options are available when systems get bogged down. The article correctly points out that you can buy bigger machines and consolidate activity. Traditional database bottlenecks can be reduced with big iron and big money. I think that’s scaling up. Another approach is to use more servers the way Google and many other Web sites do. I think that’s scaling out. The third option is to distribute the work over many commodity machines. But distributed processing brings some new headaches, and it is not a cure-all. There’s another option that walks a middle path. You “scale diagonally.” I think this means some “up” and some “out.” I’m sure some fancy Harvard MBA created apt terminology for this approach, but I think the phrase “technology grazing” fits the bill. The Baseline editors loved this story; the author loved it; and most readers of Baseline will love it. But when I read it, three points jabbed me in the snoot.

First, pages 58 and 59 feature pictures of three high-end servers. Most readers will not get their hands on these gizmos, but for techno-geeks these pix are better than a Sports Illustrated swimsuit issue. But no comparative data are offered. I don’t think anyone at ZD saw these super-hot computers or actually used them. With starting prices are in six figures and soaring effortlessly to $2 million or more for one server, some product analysis would be useful. It is clear from the article, that for really tough database jobs, you will need several of these fire breathers. The three servers are the HP Integrity Superdome, the Unisys ES7000/one, and the IBM p5 595. And page 60 has a photo of the Sun SPARC Enterprise M9000.From these graphics, I knew that the article was going to make the point that for my enterprise data center, I would have to have these machines. By the way, HP, IBM, and Sun are listed as advertisers on page 8. Do you think an ad sales professional at ZD will suggest to Unisys that it too should advertise in Baseline? The annoyance: product fluff presented as management meat.

Second, the reason to buy big, fast iron is the familiar RDBMS or relational database management system. The article sidesteps the ubiquitous Codd architecture. Today, Dr. Codd’s invention is asked to do more, faster. The problem is that big iron is a temporary fix. As the volume of data and transactions rise, today’s hot iron won’t be hot enough. I wasn’t reading about a solution; I was getting a dose of the hardware sales professional’s Holy Grail–guaranteed upgrades. I don’t think bigger iron will resolve transaction bottlenecks with big data. The annoyance: the IT folks embracing a return to the mainframe may be exacerbating the crisis.

Third, I may be too sensitive. I came away from the article with the sense that distributed, massively parallel systems are okay for lightweight applications. Forget it for the serious computing work. For real work, you need HP Integrity Superdome, Unisys ES7000/one, IBM p5 595, or the Sun SPARC Enterprise M9000. Baseline hasn’t suggested how to remove the RDBMS handcuffs limiting the freedom of some organizations. Annoyance: no solution presented.

Google’s enterprise boss is Dave Girouard. Not long ago, I heard him make reference to a coming “crisis in IT.” This Baseline article makes it clear that IT professionals who keep their eyes firmly on the past have to embrace tried and true solutions.

In my opinion here’s what’s coming and not mentioned or even hinted at in the Baseline article. The present financial downturn is going accelerate some signficant changes in the way organizations manage and crunch their data. Economics will start the snowball rolling. The need for speed will stimulate organizations’ appetite for a better way to handle mission-critical data management tasks.

Amazon and Google are embracing different approaches to “old” data problems. If I read the Baseline article correctly, lots of transactions in real time aren’t such a good idea with distributed, massively parallel architecture built on commodity hardware. What about Amazon and Google? Both run their respective $15 billion dollar annual turnover on this type of platform. And dozens of other companies are working like beavers to avoid a 1970s computer center using multi-core CPUs.

Finally, the problem is different from those referenced in the Baseline article. In Beyond Search, I profile one little-known Google initiative which may or may not become a product. But the research is forward looking and aims to solve the database problem. Not surprisingly, the Google research uses the commodity hardware and Googley distributed, massively parallel infrastructure. Could it be time for companies struggling with older technologies to look further ahead than having a big rack stuffed with several score multi-core CPUS? Could it be time to look for an alternative to the relational database? Could it be time to admit that the IT crisis has arrived?

Baseline seems unwilling to move beyond conventional wisdom. True, the article does advise me to “scale diagonally.” The problem is that I don’t know what this means. Do you?

Stephen Arnold, January 27, 2008

Twisty Logic

January 25, 2008

I live in a hollow in rural Kentucky. The big city newspaper is The Courier-Journal, and I look at it every day, and I even read a story or two. The January 25, 2008, edition ran a story called “Comair Passengers Blamed in Crash.” The story by Tom Loftus contained the phrase contributory negligence. This phrase might be useful or not too useful if you find yourself taking your search engine vendor to court.

The phrase contributory negligence is used in the context of a tragic air craft mishap in Lexington, Kentucky. The attorney defending the airline in the matter “has claimed that passengers were partly to blame for their own deaths in the August 2006 crash…” The logic is that passengers knew the weather was bad and that the runway was under construction.

According to the article, the attorney did not include this argument in his defense of his client.

The notion that a vendor can blame the customer for failure is an interesting idea, and it is one that, I must admit, I had not considered. In Kentucky, this logic makes sense. But when you have a million dollar investment in a search system, you may want to make sure that you document your experience with the search system.

If you don’t or you take short cuts in the implementation, you might find yourself on the wrong side of the logic that asserts you — the customer — caused the problem. The vendor wins.

You can read the full Courier-Journal story here. The Web site rolls off older stories, so you might encounter a dead link. Hurry. The story was still live at 9 30 pm, Friday, January 25, 2008.

Stephen Arnold, January 25, 2008

Transformation: An Emerging “Hot” Niche

January 25, 2008

Transformation is a $5 dollar word that means changing a file from one format to another. The trade magazines and professional writers often use data integration or normalization to refer to what amounts to taking a Word 97 document with a Dot DOC extension and turning it into a structured document in XML. These big words and phrases refer to a significant gotcha in behind-the-firewall search, content processing, and plain old moving information from one system to another.

Here’s a simple example of the headaches associated with what should be a seamless, invisible process after a half century of computing. The story:

You buy a new computer. Maybe a Windows laptop or a new Mac. You load a copy of Office 2007, write a proposal, save the file, and attach it to an email that says, “I’ve drafted the proposal we have to submit tomorrow before 10 am.” You send the email and go out with some friends.

In the midst of a friendly discussion about the merits of US democratic presidential contenders, your mobile rings. You hear your boss saying over the background noise, “You sent me a file I can’t open. I need the file. Where are you? In a bar? Do you have your computer so you can resend the file? No? Just get it done now!” Click here to read what ITWorld has to say on this subject. Also, there’s some user vitriol over Word to Word compatibiity hassle itself here. A work around from Tech Addict is here

Another scenario is to have a powerful new content processing system that churns through, according to the vendor’s technical specification, “more than 200 common file types.” You set up the content processing gizmo, aim it at the marketing department’s server, and click “Index.” You go home. When you arrive the next morning at 8 am, you find that the 60,000 documents in the folders containing what you wanted indexed had become an index with 30,000 documents.” Where are the other 30,000 documents? After a bit of fiddling, you discover the exception log and find that half of the documents you wanted indexed were not processed. You look up the error code and learn that it means, “File type not supported.”

The culprit is the inability of one system to recognize and process a file. The reasons for the exceptions are many and often subtle. Let’s troubleshoot the first problem, the boss’s inability to open a Word 2007 file sent as an attachment to an email.

The problem is that the recipient is using an older version of Word. The sender saved the file in the most recent Word’s version of XML. You can recognize these files by their extension Dot DOCX. What the sender should have done is save the [a] proposal as either a Dot DOC file in an older “flavor” of Word’s DOC format; [b] file as the now long-in-the-tooth RTF (rich text format) type; or [c] file in Dot TXT (ASCII) format. The fix is for the sender to resend the file in a format the recipient can view. But that one file can cost a person credibility points or the company a contract.

The second scenario is more complicated. The marketing department’s server had a combination of Word files, Adobe Portable Document Format files with Dot PDF extensions, some Adobe InDesign files, some Quark Express files, some Framemaker files, and some database files produced on a system no one knows much about except that the files came from a system no longer used by marketing. A bit of manual exploration revealed that the Adobe PDF files were password protected, so the content processing system rejected them. The content processing system lacked import filters to open the proprietary page layout and publishing program files. So it rejected them. The mysterious files from the disused system were data dumps from an IBM CICS system. The content processing system opened and then found them unreadable, so those were exceptions as well.

Now the nettles, painful nettles:

First, fixing the problem with any one file is disruptive but usually doable. The reputation damage done may or may not be repaired. At the very least, the sender’s evening was ruined, but the high-powered vice president was with a gaggle of upper crust types arguing about an election’s impact on trust funds. To “fix” the problem, she had to redo her work. Time consuming and annoying to leave her friends. The recipient — a senior VP — had to jiggle his plans in order to meet the 10 am deadline. Instead of chlling with The Simpsons TV show, he had to dive into the proposal and shape the numbers under the  pressure of the looming deadline.

We can now appreciate a 30,000 file problem. It is a very big problem. There’s probably no way to get the passwords to open some the PDFs. So, the PDFs’ content may remain unreadable. The weird publishing formats have to be opened in the application that created them and then exported in a file format the content processing system understands. This is a tricky problem, maybe another Web log posting. An alternative is to print out hard copies of the files, scan them, use optical character recognition software to create ASCII versions, and then feed the ASCII versions of the files to the content processing system. (Note: some vendors make paper-to-ASCII systems to handle this type of problem.) Those IBM CICS files can be recovered, but an outside vendor may be needed if the system producing the files is no longer available in house. When the costs are added up, these 30,000 files can represent hundreds of hours of tedious work. Figure $60 per hour and a week’s work if everything goes smoothly, and you can estimate the minimum budget “hit”. No one knows the final cost because transformation is dicey. Cost naivety is the reason my blood pressure spikes when a vendor asserts, “Our system will index all the information in your organization.” That’s baloney. You don’t know what will or won’t be indexed unless you perform a thorough inventory of files and their types and then run tests on a sample of each document type. That just doesn’t happen very often in my experience.

Now you know what transformation is. It is a formal process of converting lead into content gold.

One Google wizard — whose name I will withhold so Google’s legions of super-attorneys don’t flock to rural Kentucky to get the sheriff to lock me up — estimated that up to 30 percent of information technology budgets is consumed by transformation. So for a certain chicken company’s $17 million IT budget, the transformation bill could be in the $5 to $6 million range. That translates to selling a heck of a lot of fried chicken. Let’s assume the wizard is wrong by a factor of two. This means that $2 to $3 million is gnawed by transformation.

As organizations generate and absorb more digital information, what happens to transformation costs? The costs will go up. Whether the Google wizard is right or wrong, transformation is an issue that needs experienced hands minding the store.

The trigger for these two examples is a news item that the former president of Fast Search & Transfer, Ali Riaz, has started a new search company. Its USP (unique selling proposition) is data integration plus search and content processing. You can read Information Week‘s take on this new company here.

In Beyond Search, I discuss a number of companies and their ability to transform and integrate data. If you haven’t experienced the thrill of a transformation job, a data integration project, or a structured data normalization task — you will. Transformation is going to be a hot niche for the next few years.

Understanding of what can be done with existing digital information is, in general, wide and shallow. Transformation demands narrow and deep understanding of a number of esoteric and almost insanely diabolical issues. Let me identify three from own personal experience learned at the street academy called Our Lady of Saint Transformation.

First, each publishing system has its own peculiarities about files produced by different versions of itself. InDesign 1.0 and 2.0 cannot open the most recent version’s files. There’s a work around, but unless you are “into” InDesign, you have to climb this learning curve and fast. I’m not picking on Adobe. The same intra-program compatibilities plague Quark, PageMaker, the moribund Ventura, Framemaker, and some high-end professional publishing systems.

Second, data files spit out by mainframe systems can be fun for a 20-something. There are some interesting data formats still in daily use. EBCDIC or Extended Binary-Coded Decimal Interchange Code is something some readers can learn to love. It is either that or figuring out how to fire up an IBM mainframe, reinstalling the application (good luck on that one, 20 somethings), restoring the data from a DASD or flat file back up tapes (another fun task for a recent computer science grad), and then outputting something the zippy new search or content processing can convert in a meaningful way. (Note: “meaningful way” is important because when a filter gets confused, it produces some interesting metadata. Some glitches can require you to reindex the content if your index restore won’t work.)

Third, the Adobe PDFs with their two layers of security can be especially interesting. If you have one level of password, you can open the file and maybe print it, and copy some content from it. Or, not. If not, you either print the PDFs (if printing has not be disabled) , and go through the OCR-to-ASCII drill. In my opinion, PDFs are like a digital albatross. These birds hang around one’s neck. Your colleagues want to “search” for the PDFs’ content in their behind-the-firewall system. When asked to produce the needed passwords, I often hear something discomforting from the marketing department. So it is no surprise to learn that some system users are not too happy.

You may find this post disheartening.


This post is chock full of really good news. It makes clear that companies in the business of transformation are going to have more customers in 2008 and 2009. It’s good news for off-shore conversion shops. Companies that have potent transformation tools are going to have a growing list of prospects. Young college grads get more chances to learn the mainframe’s idiosyncrasies.

The only negative in this rosy scenario is for the individual who:

  • Fails to audit the file types and the amount of content in those file types
  • Skips determining which content must be transformed before the new system is activated
  • Ignores the budget implications of transformation
  • Assumes that 200 or 300 filters will do the job
  • Does not understand the implications behind a vendor’s statement along these line: “Our engineers can create a custom filter for you if you don’t have time to do that scripting yourself.”

One final point: those 200 or more file types. Vendors talk about them with gusto. Check to see if the vendor is licensing filters from a third party. In certain situations, the included file type filters don’t support some of the more recent applications’ file formats. Other vendors “roll their own” filters. But filters can vary in efficacy because different people write them at different times with different capabilities. Try as they might, vendors can’t squash some of the filter nits and bugs. When you do some investigating, you may be able to substantiate my data that suggest filters work on about two thirds of the files you feed into the search or content processing system. Your investigation may prove my data incorrect. No problem. When you are processing 250,000 documents, the exception file becomes chunky from the system’s two to three percent rejection rate. A thirty percent rate can be a show stopper.

Stephen E. Arnold, January 25, 2008

Next Page »

  • Archives

  • Recent Posts

  • Meta