Google Sites, Publishing, and Search

March 1, 2008

In the Seattle airport, I fielded a telephone call on February 28, 2008, about Google Sites, the reincarnation of Joe Kraus’s JotSpot, which Google acquired in 2006.

The caller, whom I won’t name, wanted my view on Google Sites as a “SharePoint killer”. As you know, Microsoft SharePoint is a content management system, search and retrieval engine, and “hub” for other Microsoft servers. SharePoint is a digital Popeil Pccket Fisherman or Leatherman tool.

Google Sites is definitely not SharePoint, nor is it a SharePoint killer. SharePoint has upwards of 65 million users, and it is — whether the users like it or not — going to be with us for long time. SharePoint is complex, requires care and feeding by Microsoft Certified Professionals, and requires a number of other Microsoft server products before it hums.

The person who called me wanted me to agree with the assertion that Google Sites is the stake through SharePoint’s bug-riddle heart. SharePoint and I have engaged in a number of alley fights, and I think SharePoint left me panting and bruised.

What is Google Sites? If you have read other essays on this “no news” Web log, you know that I try to look at issues critically and unencumbered by the “received wisdom” of the crowds of Internet pundits.

I included a chapter in my September 2007 study Google Version 2.0 that summarized a few of Google’s content-centric inventions. The JotSpot acquisition provided Google with software and engineers “up to speed” on a system and method for users to create structured information. The structured reference means that content keyed into the JotSpot interface is tagged. Tagged information can be indexed and the metadata allow the information to be sliced and diced. (Sliced and diced means manipulated programmatically.)

So, JotSpot is a component in a broader information initiative at Google. The JotSpot interfaces are fairly genertic, and you can review them here. There’s an employee profile, a student club, and a team project. Availability is limited to users who sign up for Google Apps. You can read about these here.

What I want to do is direct your attention to this diagram that I developed in 2005 for my Google Business Strategy seminars that I gave between 2004 – 2006.

Publishing chain

Notice that this diagram doesn’t make any reference to the enterprise. The solid blue arrows indicate that Google has project underway with these entities. Underway, as I use the word, means with or without the cooperation of the identified organizations. For example, Google is indexing US government content for its US government information search service. You can access this service here. The other light yellow boxes name Google services, including Google’s scanning and indexing services, among others.

The dotted line connecting Google to authors is the Google Sites’ function that I think is more important than SharePoint features, the well-known and often controversial deals for information, and the Google Base “upload” function.

I think Google Sites makes it possible — let me emphasize that this is my opinion — and I have zero interaction with Google. Google ignores my requests for comments and information. So internalize this information before reading the next paragraph.

Google Sites makes it possible for Google to go directly to authors, have them enter their information into the Google Sites’s interface, and make that original, primary information available to Google users. With the flip of a bit, Google morphs into a publisher. Google Sites has the potential — if Google wishes to move in this direction — to disintermediate traditional information middle “men” (I’m not being sexist; I’m just using jargon, gentle readers.)

Now let me tell you what I told the person who called me at 10 pm on Thursday, February 28, 2008, as I waited for a red eye to wing me back to the bunker in Harrod’s Creek. I said (and I’m paraphrasing):

“Google Sites may impinge on SharePoint over time. Google Sites may make Google Apps more appealing to enterprise customers. But I think the real significance of Google Sites is that Google is edging ever closer to getting authors to create information for Google. Google can index that content. Slice it. Dice it. Sell it. Authors have been getting a short end of the royalty and money sticks since Gutenberg. If Google meshes selling information via Google Checkout with Google advertising, Google can offer authors a reasonable percentage of the revenue from their work. In a flash, some authors would give Google a whirl. If the authors get reasonable money from their Google deal, it is the beginning of a nuclear winter for traditional publishers. I’m an author. I actually like my publishers Harry Collier, Tony Byrne, Tom Hogan, and Frank Gilbane. But if Google offered me a direct deal with them, I would take it in a heartbeat. This author wants money.”

My caller did not want to hear this. She works for a large, well known publisher. My take on Google Sites pushed her cherished SharePoint argument aside. My suggestion that Google Sites could generate money faster and with greater long-term impact than mud wrestling with Microsoft was one she had not considered.

I know from my work with traditional publishers that the majority of those business magnates don’t think Google could lure an author under contract without a great deal of work. I don’t agreement. The traditional publishing industry is panting between rounds. Many of its digital swings are going wide of the mark.

Google Sites might be a painful blow, worsened as the publishing industry watches the authors swarm to the Google. Google has search and eyeballs. Google has ads and money. Google has original content and people who want math and health information in real time, not after a 12 month peer review process. Times are changing, and most traditional publishing operations are moving deck chairs on a fragile ocean liner of a business model. SharePoint might be collateral damage. The real target are the aging vessels in the shipping lanes of traditional publishing.

Agree? Disagree? Let me know.

Stephen Arnold, March 1, 2008

Google Pressure Wave: Do the Big Boys Feel It?

February 25, 2008

In 2004, I began work on The Google Legacy: How Google’s Interent Search Is Transforming Application Software. The study grew from a series of research projects I did starting in 2002. My long-time colleague, friend, and publisher — Harry Collier, Infonortics Ltd. in Tetbury, Glou. — suggested I gather together my various bits and pieces of information. We were not sure if a study going against the widely-held belief that Google was an online ad agency would find an audience.

The Google Legacy focused on Google’s digital “moving parts” — the sprockets and gears that operate out of sight for most. The study’s major finding was that Google set out to solve the common problems of performance and relevance in Web search. By design or happenstance, the “solution” was a next-generation application platform.

The emergence of this platform — what I called the Googleplex, a term borrowed from Google’s own jargon for its Mountain View headquarters — took years. Its main outlines were discernable by 2000. At the time of the initial public offering in 2004, today’s Googleplex was a reality. Work was not finished, of course, and probably never will be. The Googleplex is a digital organism, growing, learning, and morphing.

The hoo hah over Google’s unorthodox IPO, the swelling ad revenue, the secrecy of the company, and the alleged arrogance of Googlers (Google jargon for those good enough to become full-time employees) generated a smoke screen. Most analysts, pundits, and Google watchers saw the swirling fog, but when The Google Legacy appeared, few had tried to get a clearer view.

Google provided some tantalizing clues about what its plumbing was doing. Today, you can download Hadoop and experiment with an open source framework similar to Google’s combo of MapReduce and the Google File System. You can also experiment with Google’s “version” of MySQL. Of course, your tests only provide a partial glimpse of the Google’s innards. You need the entire complement of Google software and hardware to replicate Google.

Google also makes available a number of technical papers, instructional videos, lectures, and code samples. A great place to start your learning about Google technical innovations is here. If you have an interest in the prose of folks more comfortable with flashy math, you can read Google technical papers here. And, if you want to dig even more deeply into Google’s mysteries, you can navigate to the US Patent & Trademark Office and read the more than 250 documents available here. The chipper green and yellow interface is a metaphor for the nausea you may experience when trying to get this clunky, Bronze Age search system to work. But when it does, you will be rewarded with a window into what makes the Google machine work its magic.

The Google Legacy remains for some an unnerving look at Google. Even today, almost three years after The Google Legacy appeared, many people still perceive Google as an undisciplined start up, little more than a ersatz college campus. The 20-somethings make money by selling online advertising. I remember reading somewhere that a Microsoft executive called the Google “a one-trick pony”.

You have to admit. For a company that will be 10-years-old in a few months, a canny thinker like Steve Ballmer has perceived the company correctly. But why not ask this question, “Has Microsoft really understood Google?” A larger and more interesting question, “Have such companies as IBM, Oracle, Reed Elsevier, and Goldman Sachs grasped Google in its entirety?”

In this essay, I want to explore this question. My method will be to touch upon some of the information my research uncovered in writing the aforementioned The Google Legacy and my September 2007 study, Google Version 2.0. Then I want to paraphrase a letter shared with me by a colleague. This letter was a very nice “Dear John” epistle. In colloquial terms, a very large technology company “blew off” my colleague because the large technology company understood Google and didn’t need my colleague’s strategic counsel about Google’s enterprise software and service initiatives.

I want to close by considering one question, “If Microsoft is smart enough to generate more than $60 billion in revenue in 2007, why hasn’t Microsoft been clever enough to derail Google?” By correspondence, if Microsoft didn’t understand Google, can we be confident that other large companies have “figured out Google”?

Microsoft Should Stalk Other Prey, Says New York Times

Today is February 25, 2008, there’s still a “cloud of unknowing” around Google. One Sunday headline “Maybe Microsoft Should Stalk Different Prey” caught my eye. The article here, penned by Randall Stross includes this sentence:

Having exhausted its best ideas on how to deal with Google, Microsoft is now working its way down the list to dubious ones — like pursuing a hostile bid for Yahoo.

Now Microsoft has been scrutinizing Google for years. Microsoft has power, customers, and money. Microsoft has thousands of really smart people. Google — in strictly financial measures — is a dwarf to Microsoft’s steroid stallion. Yet I was learning from the outstanding, smart reporter Randall Stross revealing that the mouse (Google) has frightened the elephant (Microsoft). Furthermore, the elephant can’t step on the mouse. The elephant cannot move around the mouse. The elephant has to do the equivalent of betting the house and children’s college fund to have an chance to escape the mouse.

Microsoft seems to be faced with some stark choices.

SWAT

For me, the amusing part of this Sunday morning “revelation” is that by the time The Google Legacy appeared in 2005, Microsoft was between a rock and a hard place with regards to Google. One example from my 2005 study will help you understand my assertion.

Going Fast Cheaply

In the research for The Google Legacy, I read several dry technical papers about Google’s read speed on Google’s commodity storage devices. A “read speed” is a measure of how much data can be moved from a storage device to memory in one second. Your desktop computer can move megabytes a second pretty comfortably. To go faster, you need the type of engineering used to make a race car outperform your family sedan.

These papers, still available at “Papers Written by Googlers” included equations, data tables, and graphs. These showed how much data Google could “read” in a second. When I read these papers, I had just completed a test of read speed on what were in 2004 reasonably zippy servers. These were IBM NetFinity 5500s. Each server had four gigabytes of random access memory, six internal high-speed SCSI drives, IBM Serveraid controllers with on board caching, and an EXP3 storage unit holding 10 SCSI III drives. For $20,000, these puppies were fast, stable, and reliable. My testing revealed that a single NetFinity 5500 server could read 65 megabytes per second. I thought that this was good, not as fast as the Sun Microsystems’ fiber server we were testing but very good.

The Google data reported that using IDE drives identical to the ones available at the local Best Buy or Circuit City, Google engineers reported read speeds of about 600 megabytes per second. Google was using off-the-shelf components, not exotic stuff like IBM Serveraid controllers, IBM-proprietary motherboards, IBM-certified drives, and even IBM FRU (field replaceable unit) cables. Google was using the low cost stuff in my father’s PC.

A Google server comparable to my NetFinity 5500 cost about $600 dollars in 2004. The data left me uncertain of my analysis. So, I had two of my engineers retrace my tests and calculations. No change. I was using a server that cost 33 times as much as Google’s test configuration server. I was running at one-tenth the read speed of Google’s server. One-tenth the speed and spending $19,400 more per server.

You don’t have to know too much about price – performance ratios to grasp the implications of these data. A Google competitor trying to match Google’s “speed” has to spend more money than Google. If Google builds a data center and spends $200 million, a competitor in 2004 using IBM or other branded server-grade hardware would have to spend orders of magnitude more to match Google performance.

The gap in 2004 and 2005 when my study The Google Legacy appeared was so significant as to be unbelievable by those who did not trouble to look at the Google data.

This is just one example of what Google has done to have a competitive advantage. I document others in my 2007 study Google Version 2.0.

In the months after The Google Legacy appeared, it struck me that only Amazon of Google’s Web competitors seemed to have followed a Google-like technical path. Today, even though Amazon is using some pretty interesting engineering short cuts, Amazon is at least in the Google game. I’m still watching the Amazon S3 meltdown to see if Amazon’s engineers have what it takes to keep pace. Amazon’s technology and research budget is a pittance compared to Google’s. Is Amazon able to out-Google Google? It’s too early to call this horse race.

Do Other Companies See Google More Clearly than Microsoft Did?

Now, let me shift to the connection I made between Mr. Stross’s article and the letter I mentioned.

Some disclaimers. This confidential letter was not addressed to me. A colleague allowed me to read the letter. I cannot reveal the name of the letter’s author or the name of my colleague. The letter’s author is a senior executive at a major computer company. (No, the company is not Microsoft.)

My colleague proposed a strategy analysis of Google to this big company. The company sent my colleague a “go away” letter. What I remember are these points: [a] (I am paraphrasing based on my recollection) our company has a great relationship with Google, and we know what Google is doing because our Google contacts are up front with us. [b] Our engineers have analyzed Google technology and found that Google’s engineering poses no challenge to us. [c] Google and our engineers are working together on some projects, so we are in daily contact with Google. The letter concluded by saying (again I paraphrase), “Thanks, but we know Google. Google is not our competitor. Google is our friend. Get lost.”

I had heard similar statements from Microsoft. But when wrapping up Google Version 2.0, I spoke with representatives of Oracle and other companies. What did I hear? Same thing: Google is our partner. Even more interesting to me was that each of these insightful managers told me their companies had figured out Google.

How Did Microsoft Get It Wrong?
This begs the question, “How did Microsoft and its advisors not figure out Google?” Microsoft has known about Google for a decade. Microsoft has responded on occasions to Google’s hiring of certain Microsoft wizards like Kai Fu Lee. Microsoft has made significant commitments to search and retrieval well before the Yahoo deal took shape. Microsoft has built an advanced research capability in information retrieval. Microsoft has invested in an advertising platform. Microsoft has redesigned Microsoft Network (MSN) a couple of times and created its own cloud-computing system for Live CRM, among other applications.

I don’t think Microsoft got it wrong. I think Microsoft looked at Google through the Microsoft “agenda”. Buttressed by the received wisdom about Google, Microsoft did not appreciate that Google’s competitive advantage in ads was deeply rooted in engineering, cost control, and its application platform. Perhaps executives in other sectors may want to step back and ask themselves, “Have we really figured out Google?”

Let’s consider Verizon’s perception of Google.

I want to close by reminding you of the flap over the FCC spectrum bid. The key development in that process was Verizon’s statement that it would become more open. Since the late 1970s, I have worked for a number of telcos, including the pre-break up AT&T, Bell Labs as a vendor, USWest before it became Qwest, and Bell Communications Research. When Verizon used the word open, I knew that Google has wrested an important concession from Verizon. Here’s Business Week’s take on this subject. At that moment, I lost interest in the outcome of the spectrum auction. Google got what it wanted, openness. Google’s nose was in Verizon’s tent. Oh, Verizon executives told me that Google was not an issue for them as recently as June 2007.

What’s happening is far larger than Microsoft “with wobbly legs, scared witless” to quote Mr. Stross. Microsoft, like Verizon, is another of the established, commercial industrial giants to feel Google’s pressure wave. Here’s a figure from The Google Legacy and the BearStearns’ report The Google Ecosystem to illustrate what Google’s approach was between 1998 and 2004. More current information appears in Google Version 2.0.

Pressure wave

You can figure out some suspects yourself. Let me give you a hint. Google is exerting pressure in its own Googley way on the enterprise market. Google is implementing thrusts using these pressure tactics in publishing, retail, banking, entertainment, and service infrastructure. Who are the top two or three leaders in each of these sectors? These are the organizations most likely to be caught in the Google pressure wave. Some will partner with Google. Others will flee. A few will fight. I do hope these firms know what capabilities Google can bring to bear on them.

The key to understanding Google is setting aside the Web search and ad razzle dazzle. The reality of Google lies in its engineering. Its key strength is its application of very clever math to make tough problems easy to resolve. Remember Google is not a start up. The company has been laboring for a decade to build today’s Google. It’s also instructive to reflect on what Google learned from the former AltaVista.com wizards who contributed much in the 1999 – 2004 period; many continue to fuel Google’s engineering prowess today. Even Google’s relationship with Xooglers (former employees who quit) extends Google’s pressure wave.

I agree that it is easy, obvious, and convenient to pigeon hole Google. Its PR focuses on gourmet lunches, Foosball, and the wacky antics of 20-year-old geniuses. Too few step back and realize that Google is a supra-national enterprise the likes of which has not been experienced for quite a while.

My mantra remains, “Surf on Google.” The alternative is putting the fate of your organization in front of Googzilla and waiting to see what happens. Surfing is a heck of a lot more fun than watching the tsunami rush toward you.

The key findings from this two and a half year effort are two:

  1. Google has morphed into a new type of global computing platform and services firm. The implications of this finding mystify wizards at some very large companies. The perception of Google as a Web search and online ad company is so strongly held that no other view of Google makes sense to these people.
  2. The application platform is more actively leveraged than most observers realize. Part of the problem is that Google is content to be “viral” and low key. Pundits see the FCC spectrum bid as a huge Google initiative. In reality, it’s just one of a a number of equally significant Google thrusts. But pundits “see” the phone activities and make mobile the “next big thing” from Google.

Stephen Arnold, February 25, 2008

Search Is a Threat. You’ve Been Warned!

February 23, 2008

It’s Saturday, February 23, 2008. It’s cold. I’m on my way to the gym to ensure that my youthful figure defies time’s corrosive forces. I look at the headlines in my newsreader, and I am now breaking my vow of “No News. No, Really!”

Thomas Claburn, Information Week journalist, penned a story with the headline, “Google-Powered Hacking Makes Search a Threat.” Read the story for yourself. Do you agree with the premise that information is bad when discoverable via a search engine?

With inputs from Cult of the Dead Cow and a nod to the Department of Homeland Security, the story flings buzzwords about security threats and offers some observations about “defending against search”. The article has a pyramid form, a super headline, quotes (lots of quotes), and some super tech references such as “the Goolag Scan”, among others. This is an outstanding example of technical journalism. I say, “Well done, sir.”

My thoughts are:

  • The fix for this problem of “bad” information is darn easy. Get one or two people to control information. The wrong sort of information can be blocked or the authors arrested. Plus, if a bad “data apple” slips through the homogenization process, we know with whom to discuss the gaffe.
  • The payoff of stopping “bad information” is huge. Without information, folks won’t know any thing “bad”, so the truth of “If ignorance is bliss, hello, happy” is realized. Happy folks are more productive. Eliminating bad information boosts the economy.
  • The organizations and individuals responsible for “threats” can be stopped. Bad guys can’t harm the good guys. Good information, therefore, doesn’t get corroded by the bad information. No bad “digital apples” can spoil the barrel of data.

I’m no Jonathan Swift. I couldn’t edit a single Cervantes’ sentence. I am a lousy cynic. I do, however, have one nano-scale worry about a digital “iron maiden”. As you may know, the iron maiden was a way to punish bad guys. When tricked out with with some inward facing spikes (shown below), the bad buy was impaled. If the bad guy was unlucky, death was slow, agonizing I assume. The iron maiden, I think, was a torture gizmo. Some historical details are murky, but I am not too keen on finding out via a demo in “iron” or in “digital” mode.

I think that trying to figure out what information is “good” and what information is “bad” is reasonably hard to do. Right, now, I prefer systems that don’t try to tackle these particular types of predictive tasks for me. I will take my chances figuring out what’s “good” and what’s “bad”. I’m 64, and so far, so good.

In behind-the-firewall systems, determining what to make available and to whom is an essential exercise. An error can mean a perp walk in an orange suit for the CEO and a pack of vice presidents.

Duplicating this process on the Web is — shall we say — a big job. I’m going to the gym. This news stuff is depressing me.

Stephen Arnold, February 23, 2008

Context: Popular Term, Difficult Technical Challenge

February 13, 2008

In April 2008, I’m giving a talk at Information Today’s Buying & Selling Econtent conference.

When I am designated as a keynote speaker, I want to be thought provoking and well prepared. So I try to start thinking about the topic a month or more before the event. As I was ruminating about my topic, I was popping in and out of email. I was doing, what some students of human behavior might call, context shifting.

The idea is that I was doing one thing (thinking about a speech) and then turning my attention to email or a telephone call. When I worked at Booz, Allen, my boss described this behavior as multi-tasking, but I don’t think what I was doing was doing two or three things at once. He was, like Einstein, not really human. I’m just a guy from a small town in Illinois, trying to do one thing and not screwing it up. So I was doing one thing at a time, just jumping from one work context to another. Normal behavior for me, but I know from observation my 86-year-old father doesn’t handle this type of function as easily as I do. I also know that my son is more adept at context shifting than I am. Obviously it’s a skill that can deteriorate as one’s mental acuity declines.

What struck me this morning was that in the space of a half hour, one email, one telephone call, and one face-to-face meeting each used the word “context”. Perhaps the Nokia announcement and its use of the word context allowed me to group these different events. I think that may be a type of meta tagging, but more about that notion in a moment.

Context seemed to be a high-frequency term in the last 24 hours. I don’t meed a Markov procedure to flag the term. The Google Trends’ report seems to suggest that context has been in a slow decline since the fourth quarter of 2004. Maybe so, but “context” was le mot de jour for me.

What’s Context in Search?

In my insular world, most of the buzzwords I hear pertain to search and retrieval, text processing, and online. After thinking about the word context, I jotted down the different meanings of the word context had in each of the communications I noticed.

The first use of context referenced the term as I defined it in my 2007 contributions to Bear Stearns’ analyst note, “Google and the Semantic Web.” I can’t provide a link to this document. You will have to chase down your local Bear Stearns’ broker to get a copy. This report describes the inventions of Ramanathan Guha. The PSE or Programmable Search engine discerns and captures context for a user’s query, the information satisfying that query, and other data that provide clues to interpret a particular situation.

The second use of context was a synonym for personalization. The idea was that a user profile would provide useful information about the meaning of a query. The idea is that a user looks for consumer information about gasoline mileage. When the system “knows” this fact, a subsequent query for “green fuel” is processed in the context of an automobile. In this case, “green” means environmentally friendly. Context makes it possible to predict a user’s likely context based on search history and implicit or explicit personalization.

The third use of context came up in a discussion about key word search. My colleague made the point that most search engines are “pretty dumb.” “The key words entered in a search box have no context,” he opined. The search engine, therefore, has to deliver the most likely match based on whatever data are available to the query processor. A Web search engine gives you a popular result for many queries. Type Spears into Google and you get pop star hits and few manufacturing and weapon hits.

When a search engine “knows” something about a user — for example, search history, factual information provided when the user registered for a free service, or the implicit or explicit information a search system gathers from users — search results can be made more on point. The idea is that the relevance of the hits matches the user’s needs. The more the system knows about a user and his context, the more relevant the results can be.

Sometimes the word context, when used in reference to search and retrieval, means “popping up a level” in order to understand the bigger picture for the user. Context, therefore, makes it possible to “know” that a user is moving toward the airport (geo spatial input), has a history of looking at flight departure information (user search history), and making numerous data entry errors (implicit monitoring of user misspellings or query restarts). These items of information can be used to shape a results set. In a more extreme application, these context data can be used to launch a query and “push” the information to the user’s mobile device. This is the “search without search” function I discussed in my May 2007 iBreakfast briefing, which — alas! — is not available online at this time.

Is Context Functionality Ubiquitous Today?

Yes, there are many online services that make use of context functions, systems, and methods today.

Even though context systems and methods add extra computational cycles, many companies are knee deep in context and its use. I think the low profile of context functions may be, in part, due to privacy issues becoming the target of a media blitz. In my experience, most users accept implicit monitoring if the user has a perception that their identity is neither tracked nor used. The more fuzzification — that is, statistical blurring — of a single user’s identity, the less the user’s anxiety about implicit tracking in order to use context data as a way to make results more relevant. Other vendors have not figured out how to add additional computational loads to their systems without introducing unacceptable latency, and these vendors offer dribs and drabs of context functionality. As their infrastructure becomes more robust, look for more context services.

The company making good use of personalization-centric context is Yahoo. Its personalized MyYahoo service delivers news and information selected by the user. Yahoo’s forthcoming OneConnect, announced this week at the telco conference in Barcelona, Spain. Based on the news reports I have seen, Yahoo wants to extend its personalization services to mobile devices.

Although Yahoo doesn’t talk about context, a user who logs in with a Yahoo ID will be “known” to some degree by Yahoo. The user’s mobile experience, therefore, has more context than a user not “known” to Yahoo. Yahoo’s OneConnect is a single example of context that helps an online service customize information services. Viewed from a privacy advocate’s point of view, this type of context is an intrusion, perhaps unwelcome. However, from the vantage point of a mobile device user rushing to the airport, Yahoo’s ability to “know” more about the user’s context can allow more customized information displays. Flight departure information, parking lot availability, or weather information can be “pushed” to the Yahoo user’s mobile device without the user having to push buttons or make finger gestures.

Context, when used in conjunction with search, refers to additional information about [a] a particular user or group of users identified as belonging to a cluster of users, [b] information and data in the system, [c] data about system processes, and [d] or information available to Yahoo though not residing on its servers.

Yahoo and T-Mobile are not alone in their interest in this type of context sensitive search. Geo spatial functions are potential enablers of news services and targeted advertising revenue. Google and Nokia seem to be moving on a similar vector. Microsoft has a keen awareness of context and its usefulness in search, personalization, and advertising.

Context has become a key part of reducing what I call the “shackles of the search box.” Thumb typing is okay but it’s much more useful to have a device that anticipates, personalizes, and contextualizes information and services. If I’m on my way to the airport, the mobile device should be able to “know” what I will need. I know that I am a creature of habit as you probably are with regard to certain behaviors.

Context allows disambiguation. Disambiguation means figuring out which of two or more possibilities is the “right” one. A good example is comes up dozens of times a day. You are in line to buy a bagel. The clerk asks you, “What kind of bagel?” with a very heavy accent, speaking rapidly and softly. You know you want a plain bagel. Without hesitation, you are able to disambiguate what the clear uttered and reply, “Plain, please.”

Humans disambiguate in most social settings, when reading, when watching the boob tube, or just figuring out weird road signs glimpsed at 60 miles per hour. Software doesn’t have the wetware humans have. Disambiguation in search and retrieval systems is a much more complex problem than looking up string matches in an index.

Context is one of the keys to figuring out what a person means or wants. If you know a certain person looks at news about Kolmogorov axioms, next-generation search systems should know that if the user types “Plank”, that user wants information about Max Planck, even though the intrepid user mistyped the name. Google seems to be pushing forward to use this type of context information to minimize the thumb typing that plagues many mobile device users today.

These types of context awareness seem within reach. Though complex, many companies have technologies, systems, and methods to deliver what I call basic context metadata. Let me note that context aware services are in wide use, but rarely labeled as “context” functions. The problem with naming is endemic in search, but you can explore some of these services at there sites. You may have to register and provide some information to take advantage of the features:

  • Google ig (Individualized Google) — Personalized start page, automatic identification of possibly relevant information based on your search history, and tools for you to customize the information
  • Yahoo MyYahoo — content customization, email previews, and likely integration with the forthcoming OneConnect service
  • MyWay — IAC’s personalized start page. One can argue that IAC’s implementation is easier to use than Yahoo’s and more graphically adept than Google’s ig service.

If you are younger than I or young at heart, you will be familiar with the legions of Web 2.0 personalization services. These range from RSS (really simple syndication) feeds that you set up to NetVibes, among hundreds of other mashy, nifty, sticky services. You can explore the most interesting of these services at Tech Crunch. It’s useful to click through the Tech Crunch Top 40 here. I have set up a custom profile on Daily Rotation, a very useful service for people in the information technology market.

An Even Tougher Context Challenge

As interesting and useful as voice disambiguation and automatic adjustment of search results are, I think there is a more significant context issue. At this time, only a handful of researchers are working on this problem. It probably won’t surprise you that my research has identified Google as the leader in what I call “meta-context systems and methods.”

The term meta refers to “information about” a person, process, datum, or other information. The term has drifted a long way from its Latin meaning of a turn in a hippodrome; for example, meta prima was the first turn. Mathematicians and scientists use the term to mean related to or based upon. When a vendor talks about indexing, the term metadata is used to mean those tags or terms assigned to an information object by an automated indexing system or a human subject matter expert who assigns index terms.

The term is also stretched to reference higher levels in nested sets. So, when an index term applies to other index terms, that broader index term performs a meta-index function. For example, if you have an index of documents on your hard drive, you can index groups of documents about a new proposal as “USDA Proposal.” The term does not appear in any of the documents on your hard drive. You have created a meta-index term to refer to a grouping of information. You can create meta-indexes automatically. Most people don’t apply a term to creating a folder name or new directory. Software that performs automatic indexing can assign these meta-index terms. Automatic classification systems can perform this function. I discuss the different approaches in Beyond Search, and I won’t rehash that information in this essay.

The “real context challenge” then is to create a meta context for available context data. Recognize that context data is itself a higher level of abstraction than a key word index. So we are now talking about taking multiple contexts, probably from multiple systems, and creating a way to use these abstractions in an informed way.

You, like me, get a headache when thinking about these Russian doll structures. Matryoshka (матрёшка)mare made of wood or plastic. When you open one doll, you see another inside. You open each doll and find increasingly small dolls inside the largest doll. The Russian doll metaphor is a useful one. Each meta-context refers to the larger doll containing smaller dolls. The type of meta context challenge I perceive is finding a way to deal with multiple matryoshkas, each containing smaller dolls. What we need, then, is a digital basket into which we can put our matryoshka. Single item of context data is useful, but having access to multiple items and multiple context containers opens up some interesting possibilities.

In Beyond Search, I describe one interesting initiative at Google. In 2006, Google acquired a small company that specialized in systems and methods for manipulating these types of information context abstractions. There is interesting research into this meta context challenge underway at the University of Wisconsin — Madison as well as at other universities in the U.S. and elsewhere.

Progress in context is taking place at two levels. At the lowest level, commercial services are starting to implement context functions into their products and services. Mobile telephony is one obvious application, and I think the musical chairs underway with Google, Yahoo, and their respective mobile partners is an indication that jockeying is underway. Also at this lowest level are the Web 2.0 and various personalization services that are widely available on Web sites or in commercial software bundles. In the middle, there is not much high-profile activity, but that will change as entrepreneurs sniff the big pay offs in context tools, applications, and services. The most intense activity is taking place out of sight of most journalists and analysts. Google, one of the leaders in this technology space, provides almost zero information about its activities. Even researchers at major universities have a low profile.

That’s going to change. Context systems and methods may open new types of information utility. In my April 2008 talk, I will provide more information about context and its potential for igniting new products, services, features, and functions for information-centric activities.

Stephen Arnold, February 13, 2008

Trapped by a Business Model, Not Technology

February 12, 2008

The headline “Reuters CEO sees Semantic Web in its Future” triggered an immediate mouse click. The story appear on O’Reilly’s highly regarded Radar Web log.

Tim O’Reilly, who wrote the article, noted: “Adding metadata to make that job of analysis easier for those building additional value on top of your product is a really interesting way to view the publishing opportunity.”

Mr. O’Reilly noted that: “I don’t think he [Devin Wenig, a Reuters executive] should discount the statistical, computer-aided curation that has proven so powerful on the consumer Internet.”

Hassles I’ve Encountered

Reuters comment about the Semantic Web did underscore the often poor indexing done by publishing and broadcasting companies. In my experience, I have had to pay for content that was in need of considerable post-processing and massaging.

For example, if you license a news feed from one of the commercial vendors, some of the feeds will:

  • Send multiple versions of the stories “down the wire”, often with tags that make it difficult to determine which is more accurate version. Scripts can delete previous versions, but errors can occur, and when noticed, some have to be corrected by manual inspection of the feed data.
  • Deliver duplicate versions of the same story because the news feed aggregator does not de-duplicate variants of the story from different sources. Some systems handle de-duplication gracefully and efficiently. Examples that come to mind are Google News and Vivisimo. Yahoo’s approach with tabs to different news services is workable as well, but it is not “news at a glance”. Yahoo imposes additional clicking on me.
  • Insert NewsXML plus additional tags without alerting downstream subscribers. When this happens, the scripts can crash or skip certain content. The news feed services try to notify subscribers about changes, but in my experience there are many “slips betwixt cup and lip.”

Now the traditional powerhouses in the news business face formidable competition on multiple fronts. There are Web logs. There are government “news” services, including the remarkably productive US Department of State, largely unknown Federal News Service , and the often useful Government Printing Office listserv. There are news services operated by trade associations. These range from the American Dental Association to
the Welding Technology Institute of Australia. Most of these organizations are now Internet savvy. Many use Web logs, ping servers, and RSS (really simple syndication) to get information to constituents, users, and news robots. Podcasts are just another medium for grass roots publishers to use at low or without cost.

We are awash in news — text, audio, and video.

Balancing Three Balls

Traditional publishers and broadcasters, therefore, are trying to accomplish three goals at the same time. I recall from a lecture that the legendary president of General Motors, Alfred P. Sloan (1875 – 1966) is alleged to have said: “Two objectives is no objective.” Nevertheless, publishers like Reuters and its soon-to-be owner are trying to balance three balls on top of one another:

First, maintain existing revenues in the face of the competition from governments, associations, individual Web log operators, and ad-supported or free Internet services.

Second, create new products and services that generate new revenue. The new revenue must not cannibalize any traditional revenue.

Third, give the impression of being “with it” and on the cutting edge of innovation. This is more difficult than it seems, and it leads to some executives’ talking about an innovation that is no longer news. Could I interpret the Reuters’ comment as an example of faux hipness?

Publishers can indeed leverage the Semantic Web. There’s a published standard. Commerical systems are widely available to perform content transformation and metatagging; for example, in Beyond Search I profile two dozen companies offering different bundles of the needed technology. Some of these are known (IBM, Microsoft); others are less well known (Bitext, Thetus). And as pre-historic as it may seem to some publishing and broadcast executives, even skilled humans are available to perform some tasks. As good as today’s semantic systems are, humans are sometimes need to do the knowledge work required to make content more easily sliced and diced, post-processed and “understood”.

It’s Not a Technology Problem

The fact is that traditional publishers and broadcasters have been slow to grasp that their challenge is their business model, not technology. No publisher has to be “with it” or be able to exchange tech-geek chatter with a Google, Microsoft, or Yahoo wizard.

Nope.

What’s needed is a hard look at the business models in use at most of the traditional publishing companies, including Reuters and the other companies who have their reports in professional publishing, trade publishing, newspaper publishing, and magazine publishing. While I’m making a list I want to include radio, television, and cable broadcasting companies as well.

These organizations have engineers who know what the emerging technologies are. There may be some experiments that are underway and yielding useful insights into how traditional publishing companies can generate new revenues.

The problem is that the old business models generate predictable revenue. Even if that revenue is softening or declining, most publishing executives understand the physics of their traditional business model. Newspapers sell advertising. Advertisers pay to reach the readers. Readers pay a subscription to get the newspaper with the ads and a “news hole”. Magazine publishers either rely on controlled circulation to sell ads or a variant of the newspaper model. Radio and other broadcast outlets sell outlets to advertisers.

These business models are deeply ingrained, have many bells and whistles, and deliver revenue reasonably well in today’s market. The problem is that the revenue efficiency in many publishing sectors is softening.

Now the publishers want to generate new revenues while preserving their traditional business models, and the executives don’t want to cannibalize existing revenues. Predictably, the cycle repeats itself. How hard is it to break the business model handcuffs of traditional publishing. Rupert Murdock has pulled in his horns at the Wall Street Journal. Not even he can get free of the business model shackles that are confining once powerful organizations and making them sitting ducks for competitive predators.

Semantic Web — okay. I agree it’s hot. I am just finishing a 250-page look at some of the companies doing semantics now. A handful of these companies are almost a decade old. Some, like IBM, were around when Albert Einstein was wandering around Princeton in his house slippers.

I hope Reuters “goes semantic”. With the core business embedded in numeric data, I think the “semantic” push will be more useful when Reuters’ customers have the systems and methods in place to make use of richer metatagging. The Thomson Corporation has been working for a decade or more to make its content “smarter”; that is, better indexing, automated repurposing of content, and making it possible for a person in one of Thomson’s more than 100 units to find out what another person in another unit wrote about the same topic. Other publishers are genuinely confused and unstandably uncertain about the Internet as an application platform. Buggy whip manufacturers could not make the shift to automotive seat covers more than a 100 years ago. Publishers and broadcasters face the same challenge.

Semantic technology may well be more useful inside a major publishing or broadcasting company initially. In my experience, most of these operations have data in different formats, systems, and data models. It will be tough to go “semantic” until the existing data can be normalized and then refreshed in near real time. Long updates are not acceptable in the news business. Wait too long, and you end up with a historical archive.

Wrap Up

To conclude, I think that new services such as The Issue, the integration of local results into Google News, and wide range of tools that allow anyone to create a personalized news feed are going to make life very, very difficult for traditional publishers. Furthermore, most traditional publishing and broadcast companies have yet to understand the differences between TV and cable programming and what I call “YouTube” programming.

Until publishing finds a way to get free of its business model “prison”, technology — trendy or not — will not be able to work revenue miracles.

Update February 13, 2008, 8 34 am Eastern — Useful case example about traditional publishing and new media. The key point is that the local newspaper is watching the upstart without knowing how to respond. Source: Howard Downs.

Stephen Arnold, February 12, 2008

Is the Death Knell for SEO Going to Sound?

February 9, 2008

Not long ago, a small company wondered why its Web site was the Avis to its competitor’s Hertz. The company’s president checked Google each day, running a query to find out if the rankings had changed.

I had an opportunity to talk with several of the people at this small company. The firm’s sales did not come from the Web site. Referrals had become the most important source of new business. The Web site was — in a sense — ego-ware.

I shared some basic information about Google’s Web master guidelines, a site map, and error-free code. These suggestions were met with what I would describe as “grim acceptance.” The mechanics of getting a Web site squared away was work but not unwelcome. Mycomments articulated what the Web team already knew.

The second part of the meeting focused on the “real” subject. The Web team wanted the Web site to be number one. I thanked the Web team and said, “I will send you the names of some experts who can assist you.” SEO work is not my cup of tea.
Then, yesterday, as Yogi Berra allegedly said, “It was déjà vu all over again.” Another local company found my name and arranged a meeting. Same script, different actors.

“We need to improve our Google ranking,” the Web master said. I probed and learned that the company’s business came within a 25 mile radius of the company’s office. Google and other search engines listed the firm’s Web site deep in the results lists.

I replayed the MP3 in my head about clean code, sitemaps, etc. I politely told the local Web team that I would email them the names of some SEO experts. SEO is definitely an issue. Is the worsening economy the reason?
Here’s a summary of my thinking about these two opportunities for me to bill some time, make some money:

  1. Firms want to be number one of Google and somehow have concluded that SEO tactics can do the trick.
  2. There is little resistance to mechanical fixes, but there is little enthusiasm for adding substantive content to a Web site
  3. In the last year, interest in getting a Web site to the top of Live.com or Yahoo.com has declined, based on my observations.

Content, the backbone of a Web site, is important to site visitors. When I do a Web search, I want links to sites that have information germane to my query. Term stuffing, ripped off content, and other “tricks” don”t endear certain sites to me.

I went in search of sources and inspiration for ranking short cuts. Let me share with you some of my more interesting findings:

You get the idea. There are some amazing assertions about getting a particular Web site to the top of the Google results list. Several observations may not be warranted, but here goes:

First, writing, even planning, high-impact, useful content is difficult. I’m not sure if it is a desire for a short cut, a lack of confidence, laziness, or inadequate training. There’s a content block in some organizations, so SEO is the way to solve the problem.

Second, Web sites can fulfill any need its owner may have. The problem is that certain types of businesses will have a heck of a time appearing at the top of a results list for a general topic. Successful, confident people expect a Web indexing system to fall prey to their charms as their clients do. Chasing a “number one on Google” can be expensive and a waste of time. There are many “experts” eager to help make a Web site number one. But I don’t think the results will be worth the cost.

Third, there are several stress points in Web indexing. The emergence of dynamic sites that basic crawlers cannot index is a growing trend. Some organizations may not be aware that their content management system (CMS) generates pages that are difficult, if not impossible, for a Web spider to copy and crunch Google’s programmable search engine is one response, and it has the potential to alter the relevance landscape if Google deploys the technology. The gold mine that SEO mavens have discovered guarantees that baloney sites will continue to plague me. Ads are sufficiently annoying. Now more and more sites in my results list are essentially valueless in terms of substantive content.

The editorial policy for most of the Web sites I visit is non-existent. The Web master wants a high ranking. The staff is eager to do mechanical fixes. Recycling content is easier than creating solid information.

The quick road to a high ranking falls off a cliff when a search system begins to slice and dice content, assigns “quality” scores to the informaton, and builds high-impact content pages. Doubt me. Take a look at this Google patent application, US20070198481 and let me know what you think.

Stephen Arnold, February 9, 2008

Taxonomy: Search’s Hula-Hoop®

February 8, 2008

I received several thoughtful comments on my Beyond Search Web log from well-known search and content processing experts (not the search engine optimization type or the MBA analyst species). These comments addressed the topic of taxonomies. One senior manager at a leading search and content processing firm referenced David Weinberger’s quite good book, Everything is Miscellaneous. My copy has gone missing, so join me in ordering a new one from Amazon. Taxonomy and taxonomies have attained fad status in behind-the-firewall search and content processing. Every vendor has to support taxonomies. Every licensee wants to “have” a taxonomy.

oraclepressroomfeb08

This is a screen shot of the Oracle Pressroom. Notice that a “taxonomy” is used to present information by category. The center panel presents hot links by topics with the number of documents shown for each category. The outside column features a tag cloud.

A “taxonomy” is a classification of things. Let me narrow my focus to behind-the firewall content processing. In an organization, a taxonomy provides a conceptual framework that can be used to the organization’s information. Synonyms for taxonomy include classification, categorization, ontology, typing, and grouping. Each of these terms can be used with broader or narrower meanings, but for my purpose, we will assume each can be used interchangeably. Most vendors and consultants toss these terms around as interchangeable Lego blocks in my experience.

A fad, as you know, is an interest that is followed for some period of time with intense enthusiasm. Think Elvis, bell bottoms, and speaking Starbuck’s coffee language.

A Small Acorn

A few years ago, a consultant approached me to write about indexing content inside an organization. This individual had embarked on a consulting career and needed information for her Web site. I dipped into my files, collected some useful information about the challenges corporate jargon presented, and added some definitions of search-related terms.

I did work for hire, so my client could reuse the information to suit specific needs. Imagine my pleasant surprise when I found my information recycled multiple times and used to justify a custom taxonomy for an enterprise. I was pleased to have become a catalyst for a boom in taxonomy seminars, newsgroups, and consulting businesses. One remarkable irony was that a person who had recycled the information I sold to consultant A thousands of miles away turned up as consultant B at a company in which I was an investor. I sat in a meeting and heard my own information delivered back to me as a way to orient me about classifying an organization’s information.

Big Oak

A taxonomy revolution had taken place, and I was only partially aware. A new industry had taken root, flowered, and spread like kudzu around me.

The interest in taxonomies continues to grow. After completing the descriptions of companies offering what I call rich content processing, organizations looking for taxonomy-centric systems have many choices. Of the 24 companies profiled in the Beyond Search study, all 24 “do” taxonomies. Obviously there are greater and lesser degrees of stringency. One company has a system that supports American National Standards Institute guidelines for controlled terms and taxonomies. Other companies “discover” categories on the fly. Between these two extremes there are numerous variations. One conclusion I drew after this exhausting analysis is that it is difficult to locate a system that can’t “do” taxonomies.

What’s Behind the Fad?

Let me consider briefly a question that I don’t tackle in Beyond Search: “Why the white-hot interest in taxonomies?”

Taxonomies have a long and distinguished history in library science, philosophy, and epistemology. For those of you who are a bit rusty, “epistemology” is the theory of knowledge. Taxonomies require a grasp, no matter how weak, on knowledge. No matter how clever, a person creating a taxonomy must figure out how to organize email, proposals, legal documents, and the other effluvia of organizational existence.

I think people have enough experience with key word search to realize its strengths and limitations. Key words — either controlled terms or free text — work wonderfully when I know what’s in an electronic collection, and I know the jargon or “secret words” to use to get the information I need.

Boolean logic (implicit or explicit) is not too useful when one is trying to find information in a typical corpus today. There’s no editorial policy at work. Anything the indexing subsystem is fed is tossed into an inverted index. This is the “miscellaneous” in David Weinberger’s book.

A taxonomy becomes a way to index content so the user can look at a series of headings and subheadings. A series of headings and sub-headings makes it possible to see the forest, not the trees. Clever systems can take the category tags and marry them to a graphical interface. With hyperlinks, it is possible to follow one’s nose — what some vendors call exploratory search or search by serendipity.

Taxonomy Benefits

A taxonomy, when properly implemented, offers yields payoffs:

First, users like to point-and-click to discover information without having to craft a query. Believe me, most busy people in an organization don’t like trying to outfox the search box.

Second, the categories — even when hidden behind a naked search box interface — are intuitively obvious to a user. An accountant may (as I have seen) enter the term finance and then point-and-click through results. When I ask users if they know specific taxonomy terms, I hear, “What’s a taxonomy?” Intuitive search techniques should be a part of behind-the-firewall search and content processing systems.

Third, management is willing to invest in fine-tuning a taxonomy. Unlike a controlled vocabulary, a suggestion to add categories meets with surprisingly little resistance. I think the intuitive usefulness of cataloging and categorizing is obvious to people who tell people to search for them.

Some Pitfalls

There are some pitfalls in the taxonomy game: The standard warnings are “Don’t expect miracles when you categorize modest volumes of content.” And “Be prepared for some meetings that are more like a graduate class in logic than trying to figure out how to deliver what the marketing department needs in a search system. ” Etc.

On the whole, the investment in a system that automatically indexes is a wise one. It becomes ever wiser when the system can use a knowledge bases, word lists, taxonomies, and other information inputs to index more accurately.

Keep in mind that “smart” systems can be right most of the time and then without warning run into a ditch. At some point, you will have to hunker down and do the hard thinking that a useful taxonomy requires. If you are not sure how to proceed, try to get your hands on a the taxonomies that once were available from Convera. Oracle one once? offered vertical term lists. You can also Google for taxonomies. A little work will return some useful examples.

To wrap up, I am delighted that so many individuals and organizations have an interest in taxonomies — whether a fad or something more epistemologically more satisfying. The content processing industry is maturing. If you want to see a taxonomy in action, check out:

HMV, powered by Dieselpoint

Oracle’s Pressrom, powered by Siderean Software’s system

US government portal powered by Vivisimo (Microsoft)

Stephen Arnold, February 8, 2008

Simple Math = Big Challenge: MSFT & YHOO

February 4, 2008

I have only a few sections of Beyond Search to wrap up. Instead of being able to think about my updating my description of Access Innovations’ MAIstro, I am distracted by jibber jabber about the Microsoft (NSDQ:MSFT) Yahoo (NSDQ:YHOO) tie up.

Where We Are

First, it’s an offer, isn’t it? Maybe a trial balloon? No cash and stock have changed hands as I write this in the wee hours of Monday, February 4, 2008. Yet, many are in a frenzy over a hostile take over. Think about this word “hostile.” It means antagonistic, unfriendly, enemy. The reason for the bold move? Google, a company that has out foxed Microserfs and Yahooligans for almost a decade.

The number of articles in my various alerts, RSS feeds, and emails is remarkable. Worldwide a Microsoft – Yahoo marriage (even it is helped along with a shotgun) ignites folks’ imagination. Neither Microsoft nor Yahoo will be able to recruit tech wizards, one pundit asserts. Innovation in Silicon Valley will be forever changed, posits another. Sigh.

Sorry. I’m not that excited. I’m interested, but I’m too old, too pragmatic, and too familiar with the vagaries of acquisitions to jump up and down.

Judging from some grousing from Yahooligans, some Yahoo professionals aren’t too keen about working for Microsoft. I have had a hint that some Microsoft wizards aren’t too excited about fiddling with Yahoo’s mind-numbing array of products, services, technologies, search systems, partnerships, and research initiatives.

I think the root concern is trying to figure out how to fit two large operations together, a 1 + 1 = 3 problem. For example, there’s Yahoo Mail and Hotmail Live; Yahoo Panama and Microsoft Ad Center; and Yahoo News and Microsoft’s new services, etc., etc. One little-considered consequence is that Microsoft may end up owning more search systems than any other company. That’s a technology can of worms worthy of a separate essay.

I will tell you who is excited, and, please, keep in mind that this is my opinion. And, once I express my view, I want to offer another very simple (probably too simple for an MBA wizard) math problem. I will end this essay with my now familiar observations. Let’s begin.

Who Benefits?

This is an easy question to answer, and you will probably think that I am stating the obvious. Bear with me because the answer explains why some at Microsoft may not be able to get the right prescription for their deal bifocals. Without the right eye glasses, it’s tough to discern some smaller environmental factors obscured in the billion dollar fusillade fired at Yahoo’s board of directors’ meeting.

  1. Shareholders who can make some money with the Microsoft offer. When there’s money to be made, concerns about technology, culture, and market opportunity are going to finish last. Most shareholders don’t think too much other than the answer to two questions: “How much did I make?” and “What are the tax implications?”
  2. Investment bankers who earn money three ways on a deal of this magnitude. There are, of course, other ways for those in the financial loop to make money, but I’m going to focus on the ones that keep these professionals in blue suits, not orange jump suits. [a] Commissions. Where the is churn, there is a commission. For many investment advisors, buying and selling equals a bigger payday. [b] Bonuses. The mechanics of an investment banker’s bonus are complex. After all, it is a banker dealing with a fellow banker. Mere mortals should steer clear. The idea is simple. Generate churn or a fee, and you get more bonus money. The first three months of a calendar year is bonus and job hopping time on Wall Street. Anyone who can get a piece of the action for a big deal gets cash. [c] Involvement in a big deal acts like a huge electro magnet for more deals. Once Microsoft “thought” of the acquisition, significant positive input about the upside of the deal pours into the potential acquirer.
  3. Consultants. Once a big deal is announced, the consultants [delete apostrophe here] leap into action. The buyer needs analyses, advice, and strategic counsel. The buyer’s minions need tactical advice to answer such questions as “How can we maximize our tax benefits?” and “How can we pay for this with cheap money?” The buyer becomes hungry for advisors of every species. Blue-chip outfits like Bain, Booz, Allen & Hamilton, Boston Consulting Group, and McKinsey & Co. drool in eagerness to provide guidance on lofty strategy matters such as answering the question, “How can I maximize my pay-out?” And “What are the tax consequences of my windfall profit?” Tactical advisors from these firms can provide support on human resource issues and real estate leases, among other matters. In short, buyers throw money at “the problem” in order to be prepared to negotiate or find a better deal.

These three constituencies want the deal to go through. If Microsoft is the buyer, that’s fine. If another outfit with cash shows, that’s okay too. The deal now has a life of its own. Money talks. To get the money, these constituencies have no desire to help Microsoft “see” some of the gaps and canyons that must be traversed. Let’s turn to one practical matter and the aforementioned simple math. Testosterone and money — these are two ways to cloud perception and jazz logic.

More Simple Math

Let’s do a thought experiment, what some German philosophers call Gedankenexperiment. I am not talking about the proposed Microsoft – Yahoo deal, gentle attorneys.

Accordingly, We have two companies, Company Alpha and Company Beta; hereinafter, Company A(lpha) and Company B(eta), neither of which is a real company and should not be construed as having any similarity with any company now in existence.

Company Alpha has a dominant position in a market and wants to gain a larger share of a newer, tangential market. Company A has a proven, well-tuned, aging business model. That business model is a variation on selling subscriptions and generating annuity income from renewals. Company A’s business model works this way. Company A offers a product and then, on a periodic basis, Company A makes a change to an existing product, assessing a fee for customers to get the “new” or “enhanced” version of the product (service).

The idea is that once a subscription base is in place, Company A can predict a certain amount of revenue from standing orders and new orders. Company A has an excellent, stable, cash flow based on this well-crafted business model and periodic fee increases. Although there are environmental factors that put pressure on the proven business model, the customer base is large, and the business model continues to work in Company A’s traditional markets. Company A, aware of exogenous factors — for instance, the emergence of cloud computing and other non-subscription business models — has learned through trial and error that its subscription-based business model does not work in certain new markets. These new markets are potentially lucrative, representing “new” revenue and a threat to Company’s existing revenue stream. Company A wants to acquire a company to increase its chances for success in the new and emerging markets. Company A’s goal is to [a] protect its existing revenue, [b] generate new revenue, and [c] prevent other companies from dominating the new market(s).

Company A has performed a rational, market analysis. Company A’s management has determined that one company only — our Company B — represents a mechanism for achieving Company A’s goals. Company A, by definition, has performed its analyses through Company A’s “eye glasses”; that is, Company A’s proven business model and business culture. “Walking in another person’s moccasins” is easy to say and difficult, if not impossible, to do. Everyone views the world through his own experiential frame. Hence, Company A “sees” Company B as having characteristics, attributes, and capabilities that are, despite some acceptable risks, significant benefits to Company A. Having made this decision about the upside from buying Company B, the management of Company A becomes less able to accept alternative inputs, facts, information, perceptions, and opinions. Company A’s reasoning in its decision space is closed. Company A vivifies what William James called “a certain blindness.” The idea is that each person is “blind” in some way to reality that others can perceive.

The implications of “a certain blindness” in this hypothetical acquisition warrant further discussion:

Culture

Company A has a culture built around a business model that allows incremental product enhancements so that subscription revenue is generated. Company B has a business model built around acquisitions. Company A has a more or less homogeneous atmosphere engendered by the business model or what Company A calls the agenda. Company B is more like a loose federation of separate companies — what some MBAs might call a Ling Temco Vought framework. Each entity within Company B retains its own identity, enjoys wide scope of action, and preserves its own culture. “We do our own thing” characterizes these units of Company B. Company A, therefore, has several options to consider:

  • Company A can leave Company B as it is. The plus is that not much will change Company B’s operations in the short term. The downside is that the technical problems will not be resolved.
  • Company A can impose its culture on Company B. You don’t need me to tell you that this will go over like the former Soviet Union’s intervention in Poland in the late 1950s.
  • Company A can try to make changes gradually. (This is a variation of the option in bullet 2 and will simply postpone rebellion. )

Technology

Company A has a different and relatively homogeneous technology base. Company B has a heterogeneous technology base. Maintaining multiple systems is more costly in general than homogeneous systems. Upon inspection, the technical staff needed to maintain these different systems have specialized to deal with particular technical problems in the heterogeneous environment. Technical people can learn new skills, but this takes time and adds cost. Company A has to find a way to streamline technical operations, reduce costs, and not waste time achieving rationalization. There are at least two ways to do this:

  • Shift to a single platform, ideally Company A’s
  • Retrain existing staff to have broader technical skills. With Company B’s staff able to perform more generalized work, Company A can reduce headcount at Company B, thus streamlining work processes and reducing cost.

Competitive Arena

The desirable new market for Company A has taking on the characteristics of what I call a “natural monopoly.” When I reflect on notable events in American business history, I note monopolistic behavior. Some monopolies were spawned by force of will; for example, JP Morgan and finance (this guy bailed out the US Treasury) and Andrew Carnegie and steel (this fellow thought of libraries for little people after pistol-whipping his competitors and antagonists).

Other monopolies — like Bell Telephone and your local electric company — came into being because some functions are more appropriately delivered by one organization. Water and Internet search / advertising, for instance, are subject to such economies of scale, quality of service, and standardization. In short, these may be “natural monopolies” due to numerous demand and cost force.

In our hypothetical example, Company A wants to enter a market which is coalescing and beginning now, based on my research, appears to be forming into a “natural monopoly”. This nameless competitor seems to be following a trajectory similar to that of the original Bell Telephone – AT&T life cycle.

Company A’s race, then, is against time and money. Untoward delay at any point going forward with regard to leveraging Company B means coming in second, maybe a distant second or losing out on the new market.

Instead of owning Park Place (a desirable property in the Parker Brothers’ game Monopoly), Company A ends up with Baltic and Mediterranean Avenues (really lousy properties in the Parker Brothers’ game). If Company A doesn’t get Company B, Company A is trapped in its old, deteriorating business model.

If Company A does acquire Company B, Company A has to challenge the competitor. Company B already has a five-year track record of being a day late and a dollar short. Company A, therefore, has to do everything in its power to make the Company B deal work, which appears to be an all-or-nothing proposition.

Now the math: Action by Company A = unknown, variable, escalating costs.

I told you math geeks would not like this analysis. Company A is betting the farm against long odds. Here’s why:

First, the cultures are not amenable to staff reductions or technological efficiencies; that is, use software and automation, not people, while increasing revenues. Company A, regardless of the money invested, cannot be certain of success. Company B’s culture – business model duality is investment insensitive. In short, money won’t close this gap. Company A’s resistance to cannibalizing its old, though still functioning, business model will be significant. Company A’s own employees will resist watching their money and jobs sacrificed to a great good.

Second, the competitive space is now being captured by the increasingly monopolistic competitor. Unchallenged for some period of time, the monopolistic competitor enjoys momentum and a significant lead in refining its own business model.

In the lingo of Wall Street, Company A can’t get enough “oxygen”; that is, revenue despite its best efforts to reign in the market leader.

Observations

If we assume a kernel of truth in my hypothetical analysis, we can now apply this hypothetical discussion to the Microsoft – Yahoo deal.

First, Microsoft’s business mode (not its technology) is the company’s strength. The business model is also its Achilles’ heel. Just as IBM’s mainframe-centric view of the world make its executives blind to Microsoft, now Microsoft can’t perceive today’s world from outside the Microsoft business model. The Microsoft business model is perhaps the most efficient subscription-based revenue generator in history. But that business model has not worked in the new markets Microsoft’s covets, so the Yahoo deal becomes the “obvious” play to Microsoft’s management. Its obviousness makes it difficult for Microsoft to see other options.

Second, the Microsoft business model is woven into the company’s culture. Cultures are ethnocentric. Ethnocentricity often manifests itself in conflict. Microsoft will have to make prescient, correct cultural decisions quickly and repeatedly. Microsoft’s culture, however, does not typically evidence excellent, rapid-fire decision-making.

Microsoft seems to be putting the company in a situation guaranteed to spark conflict within its own walls, between itself and Yahoo, and between Microsoft and Google. This is a three-front war. Even those with little exposure to military history can see that the costs and risks of a three-front conflict will be high, open-ended, and difficult to estimate.

The hostile bid itself is suggestive that Microsoft could not catch Google without Google, the notion that Microsoft can catch Google with the acquisition requires tremendous confidence in Microsoft’s management. I think Microsoft can make the deal work, but I think that execution must be flawless and that favorable winds push Microsoft along.

If Google continues to race forward, Microsoft has to spend more money to implement efficiencies more quickly. The calculus of catching a moving target can trigger a cost crisis. If costs go up too quickly, Microsoft must fall back on its proven business model. Taking a step backward when resolving the calculus of catching Google is not a net positive.

As you read this essay, you are wondering, “How can this doom and gloom be real?” The buzz about the deal is mostly positive. If you don’t believe me, call your broker and ask him how much your mutual fund will benefit from the MSFT – YHOO tie up.

I’ve spent some time around money types, and I can tell you making money is akin to blood in the water for sharks.

I’ve also been acquired and done the acquiring. Regardless of being the buyer or being the bought, ties ups are tricky. The larger the stakes, the more tricky the tie ups become. When the tie up is designed to halt the Google juggernaut, the calculus of time – cost is hard.

Please, recall, that I’m not saying that stopping Google is impossible for a Microsoft – Yahoo tie up to deliver. Making the tie up work will be difficult.

Don’t agree? That’s okay. Use the comments to set me straight. I’m willing to listen and learn. Just don’t overlook my core points; namely, business models, cultures, and technologies. One final thought: don’t factor out the Google (NSDQ:GOOG).
Stephen Arnold, February 4, 2008

Lotsa Search at Yahoo!

February 3, 2008

Microsoft’s hostile take over of Yahoo! did not surprise me. Rumors about Micro – hoo or Ya – soft have floated around for a couple of years. I want to steer clear of the newsy part of this take over, ignore the share-pumping behind the idea that Mr. Murdoch will step in to buy Yahoo, and side step Yahoo’s 11th hour “we’re not sure we want to sell” Web log posting.

I prefer to do what might be called a “catalog of search engines,” a meaningless exercise roughly equivalent to Homer’s listing of ships in The Illiad. Scholars are still arguing about why he included the information and centuries later continue to figure out who these guys were and why such an odd collection of vessels was necessary. You may have a similar question about Yahoo’s search fleet after you peruse this short list of Yahoo “findability” systems:

  • InQuira. This is the Yahoo natural language customer support system. InQuira was formed from three smaller search outfits that ran aground. InQuire seems stable, and it provides NLP systems for customer support functions. Try it. Navigate to Yahoo. Click Help and ask a question, for example, “How do I cancel my premium mail account?” Good luck, but you have an opportunity to work with an “intelligent” agent who won’t tell you how to cancel a for-fee Yahoo service. When I learned of this deal, I asked, “Why don’t you just use Inktomi’s engine for this?” I didn’t get an answer. I don’t feel too bad. Google treats me the same way.
  • Inktomi. Yahoo bought this Internet indexing company in 2002. We used the Inktomi system for the original US government search service, FirstGov.gov (now USA.gov). The system worked reasonably well, but once in the Yahooligans’ hands, not much was done with the system, and Inktomi was showing its age. In 2002, Google was motoring just drawing even with Yahoo. Yahoo seemed indifferent or unaware that search had more potential than Yahoo’s portal approach.
  • Stata Labs. When Gmail entered semi-permanent beta, it offered two key features. First, there was one gigabyte of storage and, two, you could search your mail. Yahoo couldn’t search email at all. The fix was to buy Stata Labs in 2004. When you use the Yahoo mail search function, the Stata system does the work. Again I asked, “Why not use one of your Yahoo search systems to search mail?” Again, no response.
  • Fast Search & Transfer. Yahoo, through the acquisition of Overture, ended up with the AllTheWeb.com Web site. The spidering and search technology are operated by Fast Search & Transfer (the same outfit that Microsoft bought for $1.2 billion in January 2008). Yahoo trumpeted the “see results as you type feature” in 2007, maybe 2006. The idea was that as you key your query, the system shows you results matching what you have typed. I find this function distracting, but you may love it. Try it yourself here. I heard that Yahoo has outsourced some data center functions to Fast Search & Transfer, which, if true, contradicts some of the pundits who assert that Yahoo has its data center infrastructure well in hand. If so, why lean on Fast Search & Transfer?
  • Overture. When Yahoo acquired Overture (the original pay-for-traffic service) in 2003, it got the ad service and the Overture search engine. Overture purchased AllTheWeb.com and ad technology from Fast Search & Transfer. When Yahoo bought Overture, Yahoo inherited Overture’s Sun Microsystems’ servers with some Linux boxes running a home brew fraud detection service, the original Overture search system, and the AllTheWeb.com site. Yahoo still uses the Overture search system when you look for key words to buy. You can try it here. (Note: Google was “inspired” by the Overture system, and paid about $1.2 billion to Yahoo to avoid a messy lawsuit about its “inspiration” prior to the Google IPO in 2004. Yahoo seemed happy with the money and did little to impede Google.)
  • Delicious. Yahoo bought Delicious in 2005. Delicious came with its weird url and search engine. If you have tried it, you know that it can return results with some latency. When it does respond quickly, I find it difficult to locate Web sites that I have seen. As far as I know, the Delicious system still uses the original Delicious search engine. You can try it here.
  • Flickr. Yahoo bought Flickr in 2005, another cog in its social, Web 2.0 thing. The Flickr search engine runs on MySQL. At one trade show, I heard that the Flickr infrastructure and its search system were a “problem”. Scaling was tough. Based on the sketchy information I have about Yahoo’s search strategy, Flickr search is essentially the same as it was when it was purchased and is in need of refurbishing.
  • Mindset. Yahoo, like Google and Microsoft, has a research and development group. You can read about their work on the recently redesigned Web site here. If you want to try Mindset, navigate to Yahoo Research and slide the controls. I’ve run some tests, and I think that Mindset is better than the “regular” Yahoo search, but it seems unchanged over the last six or seven months.

I’m going to stop my listing of Yahoo’s search systems, although I could continue with the Personals search, Groups search, News search, and more. I may comment on AltaVista.com, another oar in Yahoo’s search vessel, but that’s a topic that requires more space than I have in this essay. And I won’t beat up on Yahoo Shopping search. If I were a Yahoo merchant, I would be hopping mad. I can’t figure out how to limit my query to just Yahoo merchants. The results pages are duplicative and no longer useful to me. Yahoo has 500 million “users” but Web statistics are mushy. Yahoo must be doing something right as it continues to drift with the breeze as a variant of America Online.

In my research for my studies and journal articles, I don’t recall coming across a discussion of Yahoo’s many different search systems. No one, it seems, has noticed that Yahoo lacks an integrated, coherent approach to search. I know I’m not the only person who has observed that Yahoo cannot mount a significant challenge to Google.

As Google’s most capable competitor, Yahoo stayed out of the race. But it baffles me that a sophisticated, hip, with-it Silicon Valley outfit like Yahoo collected different search systems the way my grandmother coveted weird dwarf figurines. Like Yahoo, my grandmother never did much with her collection, I may have to conclude that Yahoo hasn’t done much with its collection of search systems.The cost of licensing, maintaining, and upgrading a fleet of search systems is not trivial. What baffles me is why on earth couldn’t Yahoo index its own email? Why couldn’t Yahoo use one of its own search systems to index Delicious bookmarks and Flickr photos? Why does Yahoo have a historical track record of operating search systems in silos, thus making it difficult to rationalize costs and simplify technical problems?

Compared to Yahoo, Google has its destroyer ship shape — if you call squishy purple pillows, dinosaur bones, and a keen desire to hire every math geek with an IQ of 165 on the planet “ship shape”. But Yahoo is still looking for the wharf. As Google churned past Yahoo, Yahoo watched Google sail without headwinds to the horizon.Over the years, I’ve been in chit-chats with some Yahoo wizards. Let me share my impressions without using the wizards’ names:

  1. Yahoo believes that its generalized approach is correct as Google made search the killer app of cloud computing. Yahoo’s very smart people seem to live in a different dimension
  2. Yahoo believes that its technology is superior to Google’s and Microsoft’s. When I asked about a Google innovation, Yahoo’s senior technologist told me that Yahoo had “surprises for Google.” I think the surprise was the hostile take over bid last week
  3. Yahoo sees its future in social, Web 2.0 services. To prove this, Yahoo hired economists and other social scientists. While Yahoo was recruiting, the company muffed the Facebook deal and let Yahoo 360 run aground. Yo, Yahoo, Google is inherently social. PageRank is based on human clicks and human-created Web pages. Google’s been social since Day One.

To bring this listing of Yahoo search triremes (ancient wooden war ships) to a close, I am not sure Microsoft, if it is able to acquire Yahoo, can integrate the fleet of search systems. I don’t think Mr. Murdoch can given the MySpace glitches. Fixing the flotilla of systems at Yahoo will be expensive and time consuming. The catch is that time is running out. Yahoo appears to me to be operating on pre-Internet time. Without major changes, Yahoo will be remembered for its many search systems, leaving pundits and academics to wonder where they came from and why. Maybe these investigators will use Google to find the answer? I know I would.

Stephen Arnold, February 3, 2008

Search Frustration: 1980 and 2008

February 2, 2008

I have received two telephone calls and several emails about user satisfaction with search. The people reaching out to me did not disagree that users were often frustrated with systems. I think the contacts were amplifications of the complexity of “getting search right”.

Instead of falling back on bell curves, standard deviations, and more exotic ways to think about populations, let’s go back in time. I want to then jump back to the present, offer some general observations, and then conclude with several of my opinions expressed as “observations”. I don’t mind push back. My purpose is to set forth facts as I understand them and stimulate discussion.

I’m quite a fan of Thucydides. If you have dipped into his sometimes stream-of-consciousness approach to history, you know that after a few hundred pages the hapless protagonists and antagonists just keep repeating their mistakes. Finally, after decades of running around the hamster wheel, resolution is achieved by exhaustion.

My hope is that with regard to search we arrive at a solution without slumping into torpor.

The Past: 1980

A database named ABI / INFORM (pronounced as three separate letters ay-bee-eye followed by the word inform) was a great online success. Its salad days are gone, but for one brief shining moment, it was white hot.

The idea for ABI (abstracted business information) originated at a university business school, maybe Wisconsin but I can’t recall. It was purchased by my friend Dennis Auld and his partner Greg Payne. There was another fellow involved early on, but I can’t dredge his name up this morning.

The database summarized and indexed journals containing information about business and management. Human SMEs (subject matter experts) read each article and wrote a 125-word synopsis. The SMEs paid particular attention to making the abstract meaty; that is, a person could read the abstract and get the gist of the argument and garner the two or three key “facts” in the source article. (Systems today perform automatic summarization, so the SMEs are out of a job.)

ABI / INFORM was designed to allow a busy person to ingest the contents of a particular journal like the Harvard Business Review quickly, or collect some abstracts on a topic such as ESOPs (Employee Stock Ownership Plans) and learn quickly on what was in the “literature” (a fancy word for current management thinking and research on a subject).

Our SMEs would write their abstracts on special forms that looked a lot like a 5″ by 8″ note card (about the amount of text on a single IBM mainframe green screen input form). SMEs would also enter the name of the author or authors, the title of the article, the source journal, and the standard bibliographic data taught in the 7th grade.

SMEs would also consult a printed list of controlled terms. A sample of a controlled term list appears below. Today, these controlled term lists are often called knowledge bases. For anyone my age, a list of words is pretty much a list of words. Flashy terminology doesn’t always make points easier to understand, which will be a theme of this essay.

Early in the production cycle, the index and abstract for each article would be typed twice once by an SME on a typewriter and then by a data entry operator into a dumb terminal. This type of information manufacturing reflected the crude, expensive systems available a quarter century ago. Once the data had been keyed into a computer system, it was in digital form, proofed, and sent via eight-track tape to a timesharing company. We generated revenue by distributing the ABI / INFORM records via Dialog Information Services, SDC Orbit, BRS, and other systems. (Perhaps I will go into more detail about these early online “players” in another post.) Our customers used the timesharing service to “search” ABI / INFORM. We split the money with the timesharing company and generally ended up with the short end of the stick.

Below is an example of the ABI / INFORM controlled vocabulary:

abi_vocabsnippet

There were about 15,000 terms in the vocabulary. If you look closely, you will see that some terms are market “rt” and “uf”. These are “related terms” and “use for” terms. The idea was that a person assigning index terms would be able to select a general term like “market shares” and see that the related terms “competition” and “market erosion” would provide pertinent information. The “uf” or “use for” reminded the indexer that “share of market” was not the preferred index term. Our vocabulary could also be used by a customer or user whom we then called a searcher in 1980.

A person searching for information in the ABI / INFORM file (database) of business abstracts could use these terms to locate precisely the information desired. You may have heard the terms precision and recall used by search engine and content processing vendors. The idea originated with the need to allow users (then called searchers) to narrow results; that is, make them more precise. There was also a need to allow a user (searcher) to get more results if the first result set contained too few hits or did not have the information the user (searcher) wanted.

To address this problem, we created classification codes and assigned these to the ABI / INFROM records as well. As a point of fact, ABI / INFORM was one of the first, if not the first, commercial database to reindex every record in its database to assign manually six to eight index terms and classification codes as part of a quality assurance project.

When we undertook this time-consuming and expensive job, we had to use SMEs. The business terminology proved to be so slippery that our primitive automatic indexing and search-and-replace programs introduced too many indexing red herrings. My early experience with machine-indexing and my having to turn financial cartwheels to pay for the manual rework has made me suspicious of vendors pushing automated systems, especially for business content. Business content indexing remains challenging, eclipsed only by processing email and Web log entries. Scientific, technical, and medical content is tricky but quite a bit less complicated than general business content. (Again, that’s a subject for another Web log posting.)

Our solution to broadening a query was to make it possible for the SME indexing business abstracts to use a numerical code to indicate a general area of business; for example, marketing, and then use specific values to indicate a slightly narrower sub-category. The idea was that the controlled vocabulary was precise and narrow and the classification codes were broader and sub-divided into useful sub-categories. A snippet of the ABI / INFORM classification codes appears below:

cccodesnippetfixed

If you look at these entries for the classification code 7000 Marketing, you will see terms such as “sn”. That’s a scope note, and it tells the indexer and the user (searcher) specific information about the code. You also see the “cd”. That means “code description”. A “code description” is provides specific guidance on when and how to use the classification code, in this case “7000 Marketing”.

Notice too that the code “7100 Marketing” is a sub-category of Marketing. The idea is that while 7000 Marketing is broad and appropriate for general articles about marketing, the sub-category allows the indexer or user to identify articles about “Market research.” While “Market research” is broad, it is ideally in a middle ground between the very broad classification code 7000 Marketing and the very specific terminology of the controlled vocabulary. We also had controlled terms lists for geography or what today is called “geo spatial coding”, document type codes, and other specialized index categories. These are important facets of the overall indexing scheme, but not germane to the point I want to make about user satisfaction with search and content processing systems.

Let’s step back. Humans created abstracts of journal articles. Humans then complete bibliographic entries for each selected article. Then an SME would index the abstracts, selecting terms that in their judgment and according to the editorial policy inherent in the controlled terms lists. These index terms became the building blocks of locating a specific article among hundreds of thousands or identifying a subset of all possible articles in ABI / INFORM directly on point to the topic on which the user wanted information.

The ABI / INFORM controlled vocabulary was used at commercial organizations to index internal documents or what we would today call “behind-the-firewall content.” One customer was IBM. Another was the Royal Bank of Canada. The need for a controlled vocabulary such as ABI / INFORM’s is rooted in the nature of business terminology. When business people speak, jargon creeps into almost every message. On top of that, new terms are coined for old concepts. For example, you don’t participate in a buzz group today. You participate in a focus group. Now you know why I am such a critic of the baloney used by search and content processing vendors. Making up words (neologisms) or misappropriating a word with a specific meaning (semantic, for example) and then gluing that word with another word with a reasonably clear meaning (processing, for example) creates the jargon semantic processing. Now I ask you, “Who knows what the heck that means?” I don’t, and that’s the core problem of business information. The language is slippery, fast moving, jargon-riddled, and fuzzy.

Appreciate that creating the ABI / INFORM controlled vocabulary, capturing the editorial policy in those lists, and then applying them consistently to what was then the world’s largest index to business and management thought was a big job. Everyone working on the project was exhausted after two years of researching, analyzing, and discussing. What made me particularly proud of the entire Courier-Journal team (organized by the time we finished into a separate database unit called Data Courier) was that library and information science courses used ABI / INFORM as a reference document. At Catholic University in Washington, DC, the entire vocabulary was used as a text book for an advanced information science class. Even today, ABI / INFORM’s controlled vocabulary stands as an example of:

  1. The complexity of creating useful, meaningful knowledge bases
  2. Proof that it is possible to index content so that it can be sliced and diced with few “false drops” or what we call today a “irrelevant hit”.
  3. A difficult domain such as business can be organized and made more accessible via good indexing.,

Now here’s the kicker, actually a knife in the heart to me and the entire ABI / INFORM team. We did user satisfaction surveys on our customers before the reindexing job and then after the reindexing job. But our users (searchers) did not use the controlled terms. Users (searchers) keyed one or two terms, hit the Enter key, and used what the system spit out.

Before the work, two-thirds of the people we polled who were known users of ABI/ INFORM said our indexing was unsatisfactory. After the work, two thirds of the people we polled who were known users of ABI / INFORM said our indexing was unsatisfactory. In short, bad indexing sucked. And better indexing sucked. User behavior was responsible for the dissatisfaction, and even today, who dares tell a user (search) that he / she can’t search worth a darn.

I’ve been thinking about these two benchmark studies performed by the Courier-Journal every so often for 28 years. Here’s what I have concluded:

  1. Inherent in the search and retrieval business is frustration with finding the information a particular user needs. This is neither a flaw in the human nor a flaw in the indexing. Users come to a database looking for information. Most of the time — two thirds to be exact — the experience disappoints.
  2. Investing person years of effort in constructing an almost-perfect epistemological construct in the form of controlled vocabularies is a great intellectual exercise. It just doesn’t pay huge dividends. Users (searchers) flounder around and get “good enough” information which results in the general dissatisfaction with search.
  3. As long as humans are involved, it is unlikely that the satisfaction scores will improve dramatically. Users (searchers) don’t want to work hard to formulate queries or don’t know how to formulate queries that deliver what’s needed. Humans aren’t going to change at least in my lifetime or what’s left of it.

What’s this mean?

Simply stated, algorithmic processes and the use of sophisticated mathematical procedures will deliver better results.

The Present: 2008

In my new study Beyond Search, I have not included much history. The reason is that today most procurement teams looking to improve an existing search system or replace one system with another want to know what’s available and what works.

The vendors of search and content processing systems have mastered the basics of key word indexing. Many have integrated entity extraction and classification functions into their content processing engines. Some have developed processes that look at documents, paragraphs, sentences, and phrases for clues to the meaning of a document.

Armed with these metatags (what I call index terms), the vendors can display the content in point-and-click interfaces. A query returns a result list, and the system also displays Use For references or what vendors call facets, hooks, or adjacent terms. The naked “search box” is surrounded with “rich interfaces”.

You know what?

Survey the users and you will find two-thirds of the users dissatisfied with the system to some degree. Users overestimate their ability and expertise in finding information. Many managers are too lazy to dig into results to find the most germane information. Search has become a “good enough” process for most users.

Rigorous search is still practiced by specialists like pharmaceutical company researchers and lawyers paid to turn over every stone in hopes of getting the client off the legal hook. But for most online users in commercial organizations, search is not practiced with diligence and thoroughness.

In May 2007, I mentioned in a talk at an iBreakfast seminar that Google had an invention called “I’m feeling doubly lucky.” The idea is that Google can look at a user’s profile (compiled automatically by the Googleplex), monitor the user’s location and movement via a geo spatial function in the user’s mobile device, and automatically formulate a query to retrieve information that may be needed by the user. So, if the user is known to be a business traveler and the geo spatial data plot his course toward La Guardia Airport, then the Google system will push to the user’s phone about which parking lot is available and whether the user’s flight is late. The key point is that the user doesn’t have to do anything but go one about his / her life. This is “I’m feeling doubly lucky” because it raises the convenient level of the “I’m feeling lucky button” on Google pages today. Press I’m feeling lucky and the system shows you the one best hit as defined by Google’s algorithmic factory. Some details of this invention appear in my September 2007 study, Google Version 2.0.

I’m convinced that automatic, implicit searching is the direction that search must go. Bear in mind that I really believe in controlled vocabularies, carefully crafted queries, and comprehensive review of results lists. But I’m a realist. Systems have to do most of the work for a user. When users have to do the searches themselves or at least most of the work, their level of dissatisfaction will remain high. The dissatisfaction is not with the controlled vocabulary, the indexing, or the particular search system. The dissatisfaction is with the work associated with finding and using the information. I think that most users are happy with the first page or first two or three results. These are good enough or at least assuage the user’s conscience sufficiently to make a decision.

The future, therefore, is going to be dominated by systems that automate, analyze, and predict what the mythical “average” user wants. These results will then be automatically refined based on what the system knows about a particular user’s wants and needs. The user profile becomes the “narrowing” function for a necessarily broad set of results.

Systems can automatically “push” information to users or at least keep it in a cache ready for near-zero latency delivery. In an enterprise, search must be hooked into work flow. The searches must be run for the user and the results displayed to the user. If not automatically, the user need only click a hot link and the needed information is displayed. A user can override an automatic systems, but I’m not sure most users would do it or care if the override were like a knob on a hotel’s air conditioner. You feel better turning the knob. You feel without control if you can’t turn the knob.

Observations

Let me offer several observations after this journey back in time and a look at the future of search and content processing. If you are easily upset, grab your antacid, because here we go:

  1. The razzle-dazzle about taxonomies, ontologies, and company-specific controlled term lists hides the fact that specific terms have to be identified and used to index automatically documents and information objects found in behind-the-firewall search systems. Today, these terms can be generated by processing a representative sample of existing documents produced by the organization. The key is a good-enough term list, not doing what was done 25 years ago. Keep in mind the phrase “good enough.” There are companies who offer software systems that can make this list generation easier. You can read about some vendors in Beyond Search, or you can do a search on Google, Live.com, or Yahoo.
  2. Users will never be satisfied. So before you dump your existing search system because of user dissatisfaction, you may want to get some other ammunition, preferably cost and uptime data. “Opinion” data are almost useless because no system will test better than another in my experience.
  3. Don’t believe the business jargon thrown at you by vendors. Inherent in business itself is a tendency to create a foggy understanding. I think the tendency to throw baloney has been around since the first caveman offered to trade a super-sharp flint for a tasty banana. The flint is not sharp; it’s like a Gillette four-track razor. The banana is not just good; it is mouth-watering, by implication a great banana. You have to invest time, effort, energy, and money in figuring out which search or content processing system is appropriate for your organization., This means head-to-head bake-offs. Few do this, and the results are clear. Most people are unhappy with their vendor, with search, and with the “information problem”.
  4. Background processes, agent-based automatic searching, and mechanisms to watch what your information needs and actions are will make search better. You enter ss cc=71? AND ud=9999 to get recent material about market research. but most people don’t and won’t.

In closing, keep these observations in mind when trying to figure out what vendors are really squabbling about. I’m not sure they themselves know. When you listen to a sales pitch, are the vendors saying the same thing? The answer is, “Yes.” You have to rise to the occasion and figure out the differences between systems. I guarantee you the vendors don’t know and if they know, the vendors sure won’t tell you.

Stephen Arnold, February 2, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta