Not So Fast, Folks
July 6, 2008
My news reader alerted me to an essay “How Fast Is Attivio?” The author is Adrian Bloem, a contributing analyst to CMSWatch.com and its Enterprise Search Report. I try to keep track of ESR and its new shepherds.
Some readers may know I suffered a “heart event” and had to step away from Enterprise Search Report in February 2007. This was a tough decision because ESR was in a way my “baby.” I wrote the first three editions (2004 to 2006).
If there was a mistake in the math of the cost analysis, I made it. I was responsible.
But when I quit, I was no longer responsible. When someone asks me why the number of profiles has been reduced, what do I know? When I stepped away from ESR, I gave up any control over what CMSWatch.com includes or excludes from the ESR fourth edition.
Sure, I still get a pittance in royalties, but if there is a mistake, I sure as heck am not responsible because I quit, allowing others to take control. I tell people, “Take it up with CMSWatch.com, not me.” I think you follow my logic here.
How does this apply to Attivio, a start up?
Attivio: Moving Beyond Search
Some history: My interest in Attivio arose from the research I did for my April 2008 monograph Beyond Search: What to Do When Your Enterprise Search System Doesn’t Work. When I was recovering from my little health problem, I did some hard thinking about the sameness of the enterprise search vendors, the many problems I had documented, and survey data that said, “60 percent of the users of enterprise search were dissatisfied with their enterprise search system.” 60 percent! That is a big number, now backed up by other survey findings. But in 2007 few knew what I knew about the problems with traditional enterprise search systems.
The data said to me, “Key word and legacy systems are not what users looking for information needed.”
My study Beyond Search explains these needs and some options, consuming about 300 pages to summarize my data. As I worked on the study, I become more vocal about the user dissatisfaction with enterprise search than some of the other speakers the search conference circuit.
The root of the problem is that the users’ needs and expectations. Most search systems were not changing fast enough or simply could not change.
A colleague alerted me to Attivio. I recall her saying, “Attivio is trying to leapfrog the many problems of traditional enterprise search.”
So, I played telephone tag and finally chased down the Attivio senior management team. I do have a tenacious streak. I used the same technique on executives from Silobreaker (Stockholm, Sweden), MarkLogic (San Carlos, California), Exalead (Paris), and 24 other companies with systems that move “beyond search”. A phase change is taking place, and I wanted first-hand intelligence about what was happening. You can read some of this information in my free Search Wizards Speak series on ArnoldIT.com.
When I read Mr. Bloem’s well-crafted essay, published on July 5, 2008, I was struck by the skepticism with regard to Attivio and, by extension, toward other companies with pre-alpha or early-stage systems; for example, Powerset (now part of Microsoft), Radar Networks (Twine), and Tigerlogic (ChunkIt), among dozens of others.
These are works in progress, and each of these companies is working hard to sign up customers and make sales.
My information about Attivio comes from my knowing a couple of the founders, several of whom who worked at Fast Search’s offices near Boston, Massachusetts. Also, I know a bit about Fast Search technology because I had an engineering oversight role for the US Federal government on the Fast Search implementation for the FirstGov.gov, a US government wide index of citizen-facing content.
In my push forward way, I tracked down Attivio’s founder–Ali Riaz–and interviewed him about his new company and its technical approach.
Mr. Riaz made it clear to me that he and his former colleagues wanted to move “beyond search”; that is, these professionals knew that key word and legacy information retrieval systems were not what users wanted. Mr. Riaz wanted to follow another path, moving to next-generation information access methods. He said:
From all our accumulated experiences in the industry, we realized that search is simply not enough to solve the problems that search has been trying to solve. We realized that today’s search platforms–specifically enterprise search–have become legacy technologies. The market needed a fresh approach, and that’s why we created Attivio.
Please, read the complete interview with Mr. Riaz, which took place in May 2008, here.
In my primary research with Attivio’s management, I learned that Attivio is not a reseller of Fast Search & Transfer’s Enterprise Search Platform. I also learned that Attivio–like IBM, Siderean Software, Tesuji, and other information access companies–was using the open source search system Lucene as a base upon which to build.
The idea was, according to Mr. Riaz:
The Active Intelligence Engine, or AIE. Our AIE enables enterprises to blend their structured data and unstructured content without compromising the richness of either, offering the precision of SQL and the fuzziness of search by “mashing up” search and business intelligence data warehouse technologies.
Attivio, like six or seven of the companies profiled in my Beyond Search study for the Gilbane Group, was designed as a blend of components–some open source, others proprietary, and a lot of their own intellectual property.
Mr. Riaz told me that he was not reselling any vendors’ technology, preferring to “do his own thing”. No big surprise here. Mr. Riaz left Fast Search in mid-2006 and I was talking to him in May 2008.
What Gnaws at Me
The issue that gnaws at me is the implication as I understand Mr. Bloem’s essay that Attivio and its executives are tainted by the problems that have surfaced after they left Fast Search in 2006.
Mr. Riaz and the other Fast Search professionals were employees, reporting to the management team in Norway.
My research indicates that Fast Search’s problems came about because of a dearth of engineers who could install and customize the Fast Search system. I written extensively on this subject in this Web log. The posts are here.
The core of my analysis pivots on gap between the ability of the Fast Search sales team to close major deals, and the difficulty Fast Search encountered hiring enough qualified engineers to implement these enterprise systems.
Without qualified engineers, it is tough to install any enterprise search system. Expertise in search and content processing remains in short supply just as it has been since the ascendance of Google.
In my experience, this type of staffing problem, like a bobsled racing down a run, begins slowly and then gains momentum over time when hiring is slowed due to a shortage of talent.
In the period between 2006 to the present, the engineering staff issue accelerated like a bobsled racing down a mountain.
SeeWhy: Real Time Business Intelligence without Search
July 6, 2008
SeeWhy came on my radar with its “no search” marketing angle. I poked around and was, at first, confused. The company appeared to occupy a no-man’s-land between search engine optimization and business intelligence that I avoid. A quick look revealed that the company has a business event system with some interesting twists.
Real Time and My Concern with the Phrase
“Real time” has been promoted from technical impossibility to buzz word. The general notion of “real time” among computer scientists is that simultaneity across linked systems is impossible outside of the bizarre world of high-energy physics. No matter how minute, latencies exists even if measured in picoseconds. But to a marketer, “real time” connotes a software, gentler world far from the “batch oriented” or human-intermediated world familiar to most professionals.
Now, real time is coming to the enterprise. Exegy, based in St. Louis, Missouri, offers an appliance that can ingest content by the megabyte per second and spit out processed content without much latency. To achieve this, Exegy has done some hardware engineering, but the gizmo works. When you shift to “real time” in the types of server environments found in a trucking company or a consulting company where capital investment is mostly out of the question, “real time” is not in Exegy’s league.
Let me be clear: to deliver near real time content processing Exegy style, you need specialized infrastructure. The average Dell server is not able to deliver no matter how insistent Bill Trucking Company’s information technology consultant becomes.
A number of text and content processing companies are asserting that their systems operate in “real time”. They don’t. Against this background, let’s look at one interesting company. I will not comment on this firm’s emphasis on real time processing, preferring to provide some basic information about this single firm and then offering, as a wrap up, a handful of generalized observations.
SeeWhy Software: Operational Business Intelligence
SeeWhy is one of the ?rst “open source” real time Business Intelligence platform for the event driven enterprise. SeeWhy continuously analyzes and interprets streams of individual business events, to alert you immediately to opportunities and risks and enable everyday decisions to be automated.
The marketing angle that snared my attention.
This company Incorporated in 2003 by BI industry veteran Charles Nicholls, SeeWhy is backed by several venture capital investors, including LogiSpring, Pentech Ventures, Delta Partners and handful of private folks. SeeWhy is headquartered in Windsor, United Kingdom.
The Charles Nicholls, founder and CEO, said here:
I began to ponder on the Business Intelligence industry with all its unfulfilled promise, often long on vision and short on delivery. The more that you challenge the status quo, the faster that you can see the opportunities to make the world a better place. It was this process that started me on a journey that led inevitably to create SeeWhy.
The basic premise of the company is summarized in this diagram from “In Search of Insight,” a 43 page document from Mr. Nicholls:
The Web 2.0 Angle
You can download a monograph “In Search of Insight” about the company’s approach to business intelligence here, no annoying registration, thank you, SeeWhy.
Email Analysis
July 5, 2008
This summer I have been asked about email analysis on two different occasions. In order to respond to these requests, I had to grind through my archive of email-related information. I wrote about Clearwell Systems and its approach earlier this year. You can read this essay here.
I cannot reproduce the information my paying customers received. I can take a representative company–in this case, Stratify, a unit of Iron Mountain–and show you two different screen shots. These layouts and representations are the property of Stratify, and I am including them in this essay for two reasons:
- Stratify has been one of the early players in text analytics. First as Purple Yogi and then as Stratify, the company was engaged in the difficult missionary marketing needed to make non believers into believers
- The company has gained some traction in the legal market, which in the US, is a booming sector. The problems of the economy translate into a harvest of riches for some legal firms. Email is a big deal in discovery, and few have the resources to get a human to read all the baloney that zooms around an organization involved in a legal matter.
The Problem
You know the problem. Email was once ASCII shot between two people on Arpanet. Today email is the bane of the knowledge worker. The volume is high. The storage systems antiquated. The attachments madden the sane. The people using email forget that the messages live on different servers and can, in the process of discovery, be copied to a storage device and delivered to the attorney or attorneys who have to find something germane to the legal matter in the terabytes of digital data.
To summarize the challenges:
- Email volume (lots of it, maybe a billion messages in a mid-sized organization every year)
- Email attachments (tough to find the “right” one)
- Email crashes (restores don’t always work, which you probably know first hand)
- Email sent as if it were a one-time, secret communication
- Email with recipients who, by definition, have some relationship.
For a lawyer, email is good and bad. It’s good if one finds a smoking gun or better yet a gun in the act of shooting. It’s bad if the bullets are coming at the opposing side’s legal eagles, worse if the bullet shoots a legal eagle out of the sky with a slug through the brain.
Ergo: email is a big, big deal in the information world of litigation.
The Solution
The fix is obvious–search. Actually to be precise, the conundrums of email invite text processing, text analytics, link analysis, relationship extraction, entity extraction, and other nifty methods.
The basics of email analysis are actually simple on the surface, more complicated under the hood and out of sight of non-technical types like lawyers: [a] copy email to a storage device that is fast, [b] tell email analysis program to index the email, [c] key word search or browse outputs, [d] make notes, print out email, and read individual documents of interest, [e] repeat taking care to bill for the time. (That’s the best part of email analysis. It’s quicker than manual methods, but the systems have to have a baby sitter. Those operating these systems can bill without working up too much of mental headache. Automated processes do make some legal thinking less painful. The best part is billing for this less stressful time.)
What do these systems show the user? The illustration below shows a Stratify search screen. Since I obtained this screen shot, Stratify has probably updated the interface. The main features are our interest. Take a look at what the Stratify system user sees when analyzing processed email:
Stratify’s email visualization
The principal features of this display are:
- Simplicity. You don’t want to confuse attorneys
- A picture showing people and their relationships as discerned by the system. Remember, an email can be sent to a person unrelated to a subject either by accident or for some other reason such as an “this is what I am doing” courtesy
- Links on the right hand panel to make it easy for the user to poke around by sender, topic, etc.
Let’s assume that the email is one part of a discovered collection of information. Stratify provides a richer interface. This one includes the bells and whistles that warrants the Stratify system price tag which is in six figures in case you want to license the system.
Fast Cash, Faster Crash
July 4, 2008
On July 3, 2008, Erick Schonfeld summarized the continuing saga of Fast Search & Transfer’s fastest move ever. The story “Did the Enron of Norway Pull a Fast One on Microsoft? More Details about the Mess at Fast Search $ Transfer? is here.
The story is quite thorough, according to my sources in Norway, and there is little I can add to the TechCrunch write up.
I would like to highlight one point, provide the links to my analysis of the Fast Search saga, and offer several observations about the nature of enterprise search. Before I start, take a look at this graphic because this is the wild bobsled ride that many vendors are queued to take:
Once a vendor starts down the sales bobsled run, it is tough to stop. The vendor has to ride to the bottom of the hill, hoping that he will not crash, rising serious injury and maybe death.
The Key Point for Me
After reading the TechCrunch essay, one segment gnawed at me; specifically:
…It [Microsoft’s paying $1.2 billion for Fast Search & Transfer] does point to a certain blindness on the part of Microsoft, or at least a willingness to look the other way, in its obsessive quest to become a player in search (see Yahoo and Powerset). It also raises questions about Fast’s underlying search technology. If Fast was having trouble closing deals for its products, how good can its technology really be?
Yes, this is the key question. The Fast Search & Transfer core technology was purpose built to index static Web sites. At the time Google started operations, AltaVista.com was an orphan, quickly losing its leadership position due to the voracious demand for resources that public Web search engines demand. The mantra is “Feed me computing resources or dies”.
Fast Search offered a Web site called AllTheWeb.com, and it was pretty good. At the time of 9/11, the AllTheWeb.com news indexing system was among the first to have reasonably timely information. Fast Search made a fateful decision in 2002 which led to Fast Search & Transfer’s exiting the Web indexing business. Fast Search sold its Web indexing business to Overture for $70 million with more money promised if certain goals were achieved. Fast Search took the money and focused on enterprise search.
The decision, as I recall my conversations with Fast Search & Transfer executives, when I was involved in the Fast Search deployment for a government project was that enterprise search was a great opportunity. Fast Search’s executives suggested to me that the company could move quickly to dominate the search market. At the time, there was little reason to doubt the confidence of the Fast Search team. A Fortune 50 was backing the Fast Search system in the government-wide indexing program. In the 2002-2003 time period, there were not too many systems that could demonstrate an index of 40 million documents. Even today, licensees of search systems do not grasp the hurdles that indexing large amounts of text puts in front of an organization. I have written extensively about this elsewhere, and I have little to add to the ignorance about search scaling that continues to plague organizations.
Business Intelligence: Growth but Is It Really Delivering
July 4, 2008
The fireworks have started in rural Kentucky. Oh, wait. That’s the neighbors firing shotguns at squirrels. Think of it as a way to give squirrels a fighting change.
Amidst the gun fire, I was chugging through my trust news reader and came across two stories about business intelligence. Both are well written and in a way complementary.
The IDC Business Intelligence Study
The first essay was by a solid journalist, Doug Henchen, who writes for Intelligent Enterprise. “IDC Report See Steady Growth for BI, Pent-Up Demand for Analytics” summarizes data about the business intelligence market or “BI” for short. You can read the full essay here. (Note: the url is a complex one, which often means a story can be tough to locate after a few days. Read Mr. Henchen’s article promptly. please.)
The essay is lengthy, and it is not possible to summarize it. Mr. Henchen crams a large amount of information into this two-page post. For me, the most important point in the article was:
Another technology seeing increased demand is text mining… with applications blossoming in areas such as voice-of-the-customer analysis. Vendors including Business Objects, SAS and SPSS have responded with recent acquisitions and product releases aimed at combining text mining and data mining techniques. The two camps of structured and unstructured data analysis remain very separate. It’s important for vendors to respond because if the products aren’t there, it makes it harder for practitioners to invest in the technology. [Some minor edits for readability made. SEA]
This observation underscores the assault on enterprise search vendors that users and business intelligence vendors are now making. Enterprise search is in a “circle the wagons” mode with significant pressure on high profile vendors from many quarters. Now business intelligence vendors see an opportunity to push applications that may be perceived as higher value.
One of the highlights of the essay is charts. Mr. Henchen has reproduced graphics, presumably from the for-fee report. Here’s an example:
So business intelligence is growing. Good news in a sinking economic ship.
Answering Questions: Holy Grail or Wholly Frustrating
July 2, 2008
The cat is out of the bag. Microsoft has acquired Powerset for $100 million. You can read the official announcement here. The most important part of the announcement to me was:
We know today that roughly a third of searches don’t get answered on the first search and first click…These problems exist because search engines today primarily match words in a search to words on a webpage [sic]. We can solve these problems by working to understand the intent behind each search and the concepts and meaning embedded in a webpage [sic]. Doing so, we can innovate in the quality of the search results, in the flexibility with which searchers can phrase their queries, and in the search user experience. We will use knowledge extracted from webpages [sic] to improve the result descriptions and provide new tools to help customers search better.
I agree. The problem is that delivering on these results is akin to an archaeologist finding the Holy Grail. In my experience, delivering “answers” and “better results” can be wholly frustrating. Don’t believe me? Just take a look at what happened to AskJeeves.com or any of the other semantic / natural language search systems. In fact, doubt is not evident in the dozens of posts about this topic on Techmeme.com this morning.
So, I’m going to offer a different view. I think the same problems will haunt Microsoft as it works to integrate Powerset technology into its various Live.com offerings.
Answering Questions: Circa 1996
In the mid 1990s, Ask Jeeves differentiated itself from the search leaders with its ability to answer questions. Well, some questions. The system worked for this query which I dredged from my files:
What’s the weather in Chicago, Illinois?
At the time, the approach was billed as natural language processing. Google does not maintain comprehensive historical records in its public-facing index. But you can find some information about the original system here or in the Wikipedia entry here.
How did a start up in the mid-1990s answer a user’s questions online? Computers were slow by today’s standards and expensive. Programming was time consuming. There were no tools comparable to python or Web services. Bandwidth was expensive and modems, chugged along south of 56 kilobits per second, eagerly slowing down in the course of a dial up session.
I have no inside knowledge about AskJeeves.com’s technology, but over the years, I have pieced together some information that allows me to characterize how AskJeeves.com delivered NLP (natural language processing) magic.
Humans.
AskJeeves.com compiled a list of frequently asked questions. Humans wrote answers. Programmers put data into database tables. Scripts parsed the user’s query and matched it to the answers in the tables. The real magic, from my point of view, was that AskJeeves.com updated the weather table, so when the system received my query “What is the weather in Chicago, Illinois?”, the system would pull the data from the weather table and display an answer. The system also showed links to weather sites in case the answer part was incorrect or not what the user wanted.
Over time, AskJeeves.com monitored what questions users asked and added these to the system.
What happened when the system received a query that could not be matched to a canned answer in a data table? The system picked the closest question to what the user asked and displayed that answer. So a question such as “What is the square of aleph zero plus N?” generated an answer along the lines “The Cubs won the pennant in 1918?” or some equally crazy answer.
AskJeeves.com discovered several facts about its approach to natural language processing:
- Humans were expensive. AskJeeves.com burned cash. The company tried to apply its canned question answering system to customer support and ended up part of the Barry Diller empire. Humans can answer questions, but the expense of paying humans to craft templates, create answer tables, and code the system were too high then and remain cash hungry today.
- Humans asked questions but did not really mean what they asked? Humans are perverse. A question like “What’s a good bar in San Francisco?” can go off the rails in many ways. For example, what type of bar does the user require? Biker, rock, blue collar? What’s San Francisco? Mission, Sunset, or Powell Street? The problem with answering questions, then, is that humans often have a tough time formulating the right question.
- Information changes. The answer today may not be the answer tomorrow. A system, therefore, has to have some way of knowing what the “right” answer is in the moment. As it turns out, the notion of “real time”–that is, accurate information at this moment–is an interesting challenge. In terms of stock prices, the “now quote” costs money. The quote from yesterday’s closing bell is free. Not only is it tricky to keep the index fresh, to have current information may impose additional costs.
This mini-case sheds light on two challenges in natural language processing.
Search: An Old Taxi with a Faux Cow Hide Interior
July 2, 2008
The last time I was in a big city I hailed a taxi. What a clunker. It smelled of fast food, incense, and hot plastic. One fender was dented and the curb side door would not open. The window would not go down. “She dead,” smiled the driver. The interior of the taxi had a set of blinking lights popular at holiday times. The taxi was a mess, but the faux cow interior was unusual. lights were working.
Thanks to ABC Australia for the photo. The original is here. http://www.abc.net.au/news/newsitems/200610/s1770336.htm
I have been clicking and scanning the opinions about the Microsoft Powerset deal. Scanning the links at Congoo.com, Megite.com, and Techmeme.com will take a long time. I have been a slacker, clicking at random and looking for some substantive news.
Why is search like a lousy taxi with a useless faux cow hide interior?
My thought for this evening is that search is string matching. The other functions are ways to:
- Make it easier for a busy person who does not have time or the desire to read a traditional document; that is, a multi page report.
- Show the user what is available and push the user toward that information. The user, who doesn’t want to make this effort, will let the software do the work.
- Support a user who is not to swift when it comes to thinking about abstract digital data.
- Reduce the time a user spends fumbling for information.
- Put training wheels on a worker who forgets work processes the way I forget where I put my automobile keys five minutes ago.
What’ happening is that key word search, string matching, and its kissing cousin Boolean are the lousy taxi. Good enough but not too pleasant.
The cow interior for search are these types of enhancements:
- Assisted navigation, a fancy term for Use For and See also references
- Clustering, putting like things together in a folder or under a heading
- Discovery, an interface that provides an overview of information
- Semantic search, a system that figures out what you mean when you type a two word query
- Natural language processing, a term that now means answering a question, assuming that someone takes the time to think up a question and type it into a search box
- Dashboards, a report that has panels or containers, each containing different information. Some dashboards look like speedometers with text; others can be quite fanciful.
- Access to metadata about what person in an organization gets the most email about a specific technical issue. This type of monitoring and analysis is now called social search because surveillance is not politically correct in many circles.
You get the idea.
Possible impacts
Let’s consider the consequences.
First, enterprise search is complicated. Today I spoke with an enthusiastic and young professional. The call touched upon creating a plan for enterprise search. Like most organizations, this outfit has three separate enterprise search systems. None work all that well, so the phone rings. This is a common situation, and I am not to optimistic that enterprise search will work very well when there are competing factions each with a favorite search engine to support. Adding whizzy new functionality adds to the cost and complexity, and I am not convinced users want to do much more than find the needed information and move on to another task.
IBM Search: Circling Back
July 1, 2008
I learned that an engineer named Michael Moran worked on IBM’s public facing search system for many years. You can read about this person’s contributions here. (Click this link quickly. The Yahoo news disappears faster than Yahooligans resign.) Mr. Moran has left IBM to join Converseon, a social media company. I hope there was no connection between my critique of IBM’s Web search system, Planetwide. But it is pretty terrible. Because Mr. Moran will not join Conversion until September 2008, he has time to tweak Planetwide and IBM’s e-commerce sub system as well.
To be fair to Big Blue, I dived back into the Web site. This time I focused on buying something. Providing an e-commerce function seems a reasonable expectation. Plus I own NetFinity 5500 servers, and I sometimes need parts.
Let’s take a look. You can look at these tiny WordPress processed screen shots or navigate to the e-commerce splash page and run this sample query.
Finding the Store Front
On the www.ibm.com splash page is a tab labeled “Shop For”. So far, so good. I click the tab and the drop down bar displays my choices.
I decide to shop for a workstation. Years ago I owned a ZPro workstation, and it was a workhorse. The case fell apart, but the guts kept on ticking for years.
Here’s the page for workstations. Remember. I want to buy something.
Instead of an Amazon or eBay like listing, I see a picture of a workstation. Okay, I click on the smaller workstations. The system shows me more text. Here is the product information for the Unix workstations that I wanted to buy, but I am now getting frustrated. Where are the products? Dell Computer in its darkest days with its sluggish e-commerce search system does better than this. Amazon, despite the baloney promoting the Kindle and showing me crazy recommendations, lets me get to products. Not IBM. The pages look alike. In fact, I am not sure that the display has changed. I like consistency, but I also like to see products.
I wade through the text in the center column under the picture and I click on IBM Intellistation POWER 265 Express. I get this screen:
More choices and more text. I scroll to the bottom of the page and I get a list of features. I am convinced. I scroll back to the top of the page where the “Browse and Buy” button is. I click it, and finally I get some bite-sized information and a price in red no less.
Microsoft Research Search Research: Not a Typo
June 29, 2008
In Chicago, I heard two earnest 20-somethings in the Starbucks on Lincoln and Greenview in Chicago arguing about Microsoft search. The two whiz kids wanted to locate information about Microsoft’s Wed Data Management Group. Part of Microsoft’s multi-billion dollar research and development program, WDMG (sometimes abbreciated WSM0 works to crack tough problems in Web search.
The problem with Web search is that content balloon with each tick of the hyper fast Internet clock. The problem boils down to a several hundred megabytes every time slice. To make the problem more interesting, Web data changes. One example ignored by researchers is the facility with which a Web log author can change a posting. Some changes are omissions such as forgetting to assign a tag. Others are more Stalinesque. An author deletes, rewrites, or supplements an original chunk of a Web log. Today, I find more and more Web sites render pages in response to an action that I take. The example which may resonate with you is the operation of a meta search or federating system like Kayak.com. Until I set parameters for a trip, the system offers precious little content. Once I fill in the particulars of my trip, the rendered pages provide some useful information.
If you plan on indexing the Web, you have to figure out these dynamic pages, versions, updates, and new content. The problem has three characteristics. First, timeliness. When I do a query, I want current information. Speed, then, requires an efficient content identification and indexing system. If I lack the computing horsepower for brute force indexing, I have to use user cues such as indexing only the most frequently requested content. In effect, I am indexing less information in order to keep that index current.
Second, I have to be able to get dynamic content into my index. If I miss the information available that becomes evident in response to a curer, I am omitting a good chuck of the content. My tests show that more than half the sites in my test set are dynamic. The static HTML of the good old days makes up a smaller portion of the content that must be processed. Google’s work with Google Forms is that company’s first step into this type of data. Microsoft has its own approaches and some of this work is handled by the wizards at WSM or Web Search and Mining Group here.
Third, I also have to figure out how to deal with queries. When I talk about search, there are two sides to the coin. On one side is indexing. On the other side is converting the query to something that can be passed against the query. If a system purports to understand natural language as Hakia and Powerset assert, then the system has to figure out what the user means. Intent is not such a simple problem. In fact, deciphering a user’s query can be more difficult than indexing dynamic content. Human language is ambiguous. You would not understand my mother if you heard her say to me, “Quilling.” She means something quite specific, and the likelihood any system could figure out that this single word means, “Bring me my work basket” is close to zero unless the system in some ways has considerable information about her specific use of language.
As you probably have surmised, natural language processing is complicated. NLP is resource intensive. I need a capable indexing system and I need a powerful, clever way to clear up ambiguities. Human don’t type long queries, nor do professionals evidence much enthusiasm for crafting query strings that retrieve exactly what that professional needs. Users type 2.3 words and take what the system displays. Others prefer to browse an interface with training wheels; that is, Use For and See Also references and explore. The difference in the two approaches share one common element: a honking big computer with smart algorithms are needed to make search work.
Web Search and Mining
This Microsoft group works on a number of interesting projects related to content processing, text mining, and search. The group’s Web page identifies data management, dynamic data indexing, and and search quality as current topics of interest.
More detail about the group’s activities appear in the list of publicly available research papers. You can browse and download these. I want to comment about three aspects of the research identified on this Web site and then close with several observations about Microsoft research into search.
First, the sample papers date from 2004. I don’t know if the group has filtered its postings of papers, or if the group has been redirected.
Second, a number of papers discuss clustering. A representative paper is Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Analysis. The full paper is here. . The paper explains a system that accepts a query and then outputs a result. Each row is a cluster. Microsoft’s researchers are parsing a query and retrieving images. The images are displayed in a clustered visual display. You will notice that the lead Microsoft researcher worked with a Yahoo researcher and a University of Chicago researcher. You can browse the other clustering papers.
Third, another group of papers touches upon the notion of “information manifolds”. In the 1990s, the phrase “information manifold” enjoyed some buzz. The notion is that a “space” contains indexes which can be queried. One Microsoft paper–” Learning an Image Manifold for Retrieval”–applies the notion to images. Other papers touch upon the topic as well. I found this interest suggestive. Google has some activity is this subject as well.
I want to pick up the thread of WSM and research into “manifolds”. I turned first to Search.Live.com, Microsoft’s own search system and Google.com Microsoft-centric search sub system. You can find Microsoft’s search here and Google’s search sub system here . You may want stray into specialist Microsoft systems such as Libra here, a showcase for some new Microsoft technology. I tried several queries on the Microsoft Live.com search site and was able to locate the paper referenced above. One of the two hits I was able to track down returned a null set.
The Whale and the Walrus: Two Views of Sergey and Larry
June 28, 2008
The purpose of this essay is to describe the life trajectory of two technology-centric companies. I don’t want to mention the firms by name, but you may be able to guess which company is the whale and which is the walrus.
The whale is a big creature, a whale of a company. Wherever the whale goes, it gets its way. More accurately, the whale used to get its way. Now the whale is lying on its side near the Seattle waterfront close to upscale boutiques and a Starbucks.
The second is a walrus, now quite old for a semi-leviathan. The walrus prefers to sit on a rock not far from Half Moon Bay, soak up the sun and snag whatever fish get too close. The walrus prefers to conserve its energy. Oh, the walrus will stretch and sometimes roar. Most of the time, the walrus half sits, half reclines looking — well –disconnected from the world beyond the sand bar. The walrus has some new friends named Sergey and Larry.
Let’s look at three aspects of each creature and then think about the future of each powerful beastie.
The Whale
The whale is the largest mammal. Not surprisingly, the whale is never sure if a sucker fish is tagging along for a free ride. The whale is also not really aware of its surroundings. The whale sings and tries to find other whales, but whales get together once in a while. Think of it as a Warren Buffet cocktail party with only whales allowed. Otherwise whales think whale thoughts, oblivious to their world.
Our whales know that tiny creatures can annoy a whale, but tiny creatures rarely hurt a whale. This whale believes it is master of all the known universe. The trick is to stay away from tiny creatures with weapons that can make life difficult. Every once in a while, the whale can gobble a tasty morsel like Fast Search & Transfer. Life has been good, but the whale senses trouble in a restless ocean.
The Walrus
The walrus is tired. The old game of providing tips to lost dolphins and tuna is not working any more. So, the walrus kicks back and thinks about what might have been.
The walrus is old, and the new ways of finding young fish eager to learn the old ways are tiring. This walrus prefers to lay down, make some noise, and wait for the next meal. Think of this walrus living in an assisted-living facility. The real world is too unfamiliar. The walrus has two new friends, Sergey and Larry. Sergey and Larry bring the walrus fish once a day. Getting fish is better than catching fish. The walrus likes not working too hard. The rock is a fine place. The waves lapping the beach in Half Moon Bay sooth the walrus. The walrus changes position but does not move.
Interpreting the Two Stories
The whale is a company that is disconnected from the world beyond the ocean. The whale is, for the first time in its life, unnerved, maybe frightened. Sergey and Larry people have a different business model. Customers use software and information and an advertiser pays the bill. The whale wants to swat Sergey, Larry with its tail. Sergey and Larry dance out of the way. The whale is frustrated and getting tired carrying the old business model into every skirmish and chase.
The walrus is an old timer in the digital world. The spring and bounce have been weighted down by wild and crazy decisions. Walrus friends are leaving the walrus more and more alone. The walrus is isolated. The old ways have lost their zip. The walrus remembers reading about automobiles and buggy whip manufacturers. The walrus believes that he might become a wallet, maybe a pair of shoes. Change, however, is hard at the walrus’ age. The walrus stays where it is, moving to catch the rays of the setting sun. Sergey and Larry will bring another fish today.
The message is clear. The whale is going to fight to survive. The walrus has given up. Sergey and Larry have the ability to deal with both the whale and the walrus with equal aplomb.
Observations
Neither creature has many years left. You have to admire the fighting whale. Too bad its own weight and mass will sap his strength. Not much future unless the whale shed some pounds like Subway’s Jared, the tuna eater. The walrus has found a new best friend and does not want to work too hard. The walrus will gladly do what Googzilla says. Those free fish are really tasty, thinks the walrus.
And what about Sergey and Larry in their “we’re just guys” outfit. Sergey and Larry want to out think the whale. The walrus seems happy as long as he gets a couple of fish every day.
In the great theater of business, the whale and the walrus are sushi.
Stephen Arnold, June 28, 2008