Microsoft Research Search Research: Not a Typo

June 29, 2008

In Chicago, I heard two earnest 20-somethings in the Starbucks on Lincoln and Greenview in Chicago arguing about Microsoft search. The two whiz kids wanted to locate information about Microsoft’s Wed Data Management Group. Part of Microsoft’s multi-billion dollar research and development program, WDMG (sometimes abbreciated WSM0 works to crack tough problems in Web search.

The problem with Web search is that content balloon with each tick of the hyper fast Internet clock. The problem boils down to a several hundred megabytes every time slice. To make the problem more interesting, Web data changes. One example ignored by researchers is the facility with which a Web log author can change a posting. Some changes are omissions such as forgetting to assign a tag. Others are more Stalinesque. An author deletes, rewrites, or supplements an original chunk of a Web log. Today, I find more and more Web sites render pages in response to an action that I take. The example which may resonate with you is the operation of a meta search or federating system like Kayak.com. Until I set parameters for a trip, the system offers precious little content. Once I fill in the particulars of my trip, the rendered pages provide some useful information.

If you plan on indexing the Web, you have to figure out these dynamic pages, versions, updates, and new content. The problem has three characteristics. First, timeliness. When I do a query, I want current information. Speed, then, requires an efficient content identification and indexing system. If I lack the computing horsepower for brute force indexing, I have to use user cues such as indexing only the most frequently requested content. In effect, I am indexing less information in order to keep that index current.
Second, I have to be able to get dynamic content into my index. If I miss the information available that becomes evident in response to a curer, I am omitting a good chuck of the content. My tests show that more than half the sites in my test set are dynamic. The static HTML of the good old days makes up a smaller portion of the content that must be processed. Google’s work with Google Forms is that company’s first step into this type of data. Microsoft has its own approaches and some of this work is handled by the wizards at WSM or Web Search and Mining Group here.

Third, I also have to figure out how to deal with queries. When I talk about search, there are two sides to the coin. On one side is indexing. On the other side is converting the query to something that can be passed against the query. If a system purports to understand natural language as Hakia and Powerset assert, then the system has to figure out what the user means. Intent is not such a simple problem. In fact, deciphering a user’s query can be more difficult than indexing dynamic content. Human language is ambiguous. You would not understand my mother if you heard her say to me, “Quilling.” She means something quite specific, and the likelihood any system could figure out that this single word means, “Bring me my work basket” is close to zero unless the system in some ways has considerable information about her specific use of language.

As you probably have surmised, natural language processing is complicated. NLP is resource intensive. I need a capable indexing system and I need a powerful, clever way to clear up ambiguities. Human don’t type long queries, nor do professionals evidence much enthusiasm for crafting query strings that retrieve exactly what that professional needs. Users type 2.3 words and take what the system displays. Others prefer to browse an interface with training wheels; that is, Use For and See Also references and explore. The difference in the two approaches share one common element: a honking big computer with smart algorithms are needed to make search work.

Web Search and Mining

This Microsoft group works on a number of interesting projects related to content processing, text mining, and search. The group’s Web page identifies data management, dynamic data indexing, and and search quality as current topics of interest.

More detail about the group’s activities appear in the list of publicly available research papers. You can browse and download these. I want to comment about three aspects of the research identified on this Web site and then close with several observations about Microsoft research into search.
First, the sample papers date from 2004. I don’t know if the group has filtered its postings of papers, or if the group has been redirected.

Second, a number of papers discuss clustering. A representative paper is Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Analysis. The full paper is here. . The paper explains a system that accepts a query and then outputs a result. Each row is a cluster. Microsoft’s researchers are parsing a query and retrieving images. The images are displayed in a clustered visual display. You will notice that the lead Microsoft researcher worked with a Yahoo researcher and a University of Chicago researcher. You can browse the other clustering papers.

Third, another group of papers touches upon the notion of “information manifolds”. In the 1990s, the phrase “information manifold” enjoyed some buzz. The notion is that a “space” contains indexes which can be queried. One Microsoft paper–” Learning an Image Manifold for Retrieval”–applies the notion to images. Other papers touch upon the topic as well. I found this interest suggestive. Google has some activity is this subject as well.

I want to pick up the thread of WSM and research into “manifolds”. I turned first to Search.Live.com, Microsoft’s own search system and Google.com Microsoft-centric search sub system. You can find Microsoft’s search here and Google’s search sub system here . You may want stray into specialist Microsoft systems such as Libra here, a showcase for some new Microsoft technology. I tried several queries on the Microsoft Live.com search site and was able to locate the paper referenced above. One of the two hits I was able to track down returned a null set.

Read more

Silobreaker Rumor

June 28, 2008

With Powerset off the search chess board, Really Simple Sidi asks, Will Silobreaker be the next information access vendor to be acquired? The question did not spring from thin air. Silobreaker executives have spoken with a number of companies about its technology. I will track Silobreaker more closely. You can read an interview with one of the company’s founders here.

Stephen Arnold, June 28, 2008

Text Analytics Summit Summary Sparks UIMA Thoughts

June 22, 2008

Seth Grimes posted a useful series of links about the Text Analytics Summit, held in Boston the week of June 16, 2008. You can read his take on the conference here. I was not at the conference. I was on the other side of the country at the Gilbane shin dig. To make up for my non attendance, I have been reading about the summit.

From what I can deduce from the Web log posts, the conference attracted the Babe Ruths and Ty Cobbs of text analysis, a market that nestles between enterprise search and business intelligence. I am not too certain about the boundaries of either of these markets, but text analytics is polymorphic and can appear searchy or business intelligency depending upon the context.

I clicked through the links Mr. Grimes provides, and I recommend that you spend a few finites with each of the presentations. I learned a great deal. Please, review his short essay.

One point stuck in my mind. The purpose of this essay is to call your attention to this comment and offer several observations about its implications for those who want to move beyond key word retrieval. Keep in mind that I am offering my opinion.

Here’s the comment. Mr. Grimes writes:

I’ll conclude with one disappointing surprise on the technical front, that UIMA — the Unstructured Information Management Architecture, an integration framework created by IBM and released several years ago as open source to the Apache — has not been more broadly accepted. IBM software architect Thomas Hampp spoke about his company’s use of the framework in the OmniFind Analytics edition, but Technology Panel participants said that their companies — Attensity (David Bean), Business Objects (Claire Thomas), Clarabridge (Justin Langseth), Jodange (Larry Levy), and SPSS (Olivier Jouve) — simply do not perceive user demand for the interoperability that UIMA can offer.

My understanding of this statement and the supporting evidence in the form of high profile industry executives is that an open standard developed by IBM has little, if any, market traction. In short, if the UIMA standard were gasoline, your automobile would not run or just sputter along.

Let us assume that this lack of UIMA demand is accurate. Now I know this is a big assumption, and I am confident that an IBM wizard will tell me that I am wrong. Nevertheless, I want to follow this assumption in the next part of the essay.

Possible Causes

[Please, keep in mind that I am offering my opinion in a free Web log. If you have not read the editorial policy for this Web log, click on the About link on any page of Beyond Search. Some readers forget that I am using this Web log as a journal and a container for the information that does not appear in my for fee reports and my paid writings such as my monthly column in KMWorld. Some folks are reading my musings and ignoring or forgetting what I am trying to capture for myself in these posts. Check out the disclaimer here.]

What might be causing the lack of interest in UIMA, which as you know is an open source framework to allow different software gizmos to talk to one another? For a more precise definition UIMA, you can give the IBM search engine a whirl or click this Wikipedia link, http://en.wikipedia.org/wiki/UIMA.

Here is my short list of the causes for the UIMA excitement void. I am not annoyed with IBM. I own IBM servers, but I want to pick up Mr. Grimes’ s statement and perform a thought experiment. If this type of writing troubles you, please, click away from Beyond Search. Also, I am reacting to a comment about IBM, but I want to use IBM as an example of any large company’s standards or open source initiative.

First, IBM is IBM. IBM has an obligation to its shareholders to deliver growth. Therefore, IBM’s promulgating a standard is in some way large or small a way to sell IBM products and services. Maybe potential UIMA users are not interested in the potential upsell that may follow.

Second, open source and standards have proven to be incredibly useful. Maybe IBM nees to put more effort into educating partners, vendors, and customers about UIMA? Maybe IBM has invested in UIMA and found that marketing did not produce the expected results, so IBM has moved on.

Third, maybe today IBM lacks clout in the search and content processing sector. In 1960, IBM could dictate what was hot and what was not. UIMA’s underwhelming penetration might be evidence that the IBM of today lacks the moxie the company enjoyed almost a half century ago.

And one fourth possibility is that no one really wants to embrace UIMA. Enterprise software is not a level playing field. The vendor wants to own the customer, locking out any other vendor who might suck dollars from the company owning a customer. IBM and other enterprise vendors want to build walls, not create open doors.

I have several other thoughts on my list, but these four provide insight into my preliminary thinking.

Observations

Now let’s consider the implications of these four points, assuming, of course, that I am correct.

  1. Big companies and standards do not blend as well as a peanut butter and jelly sandwich. The two ingredients may not yet be fully in harmony. Big companies want money and open standards do not have the revenue to risk ratio that makes financial officers comfortable.
  2. Open source is hard to control. Vendors and buyers want control. Vendors want to control the technology. Buyers want to control risk. Open source may reduce the vendor’s control over a system and buyers lose control over the risk a particular open source system introduces into an enterprise.
  3. Open source appeals to those willing to break with traditional information technology behavior. IBM, despite its sporty standards garb, is a traditional vendor selfing traditional solutions. Open source is making headway, but it is most successful when youthful blood flows through the enterprise. Maybe UIMA needs more time for the old cows to leave the stock pen?

What is your view? Is your organization ready to embrace UIMA, big company standards, and open source? Agree? Disagree? Let me know.

Stephen Arnold, June 22, 2008

Gilbane Chats Up a Silly Goose: The Arnold Interview

June 18, 2008

On Wednesday, June 18, 2008, I will be interviewed in front of an audience completely unaware of why a fellow from Harrod’s Creek, Kentucky, is sitting on a stage answering questions. No one is more baffled than I. Based on my knowledge of the big city, I anticipate confusion, torpor, and indifference to my comments.

In this essay, which will become available on June 18, 2008, the curious will have a reference document that summarizes my thoughts on issues about which I may be asked. There has been no dry run for this interview. The last one in which I participated–the Associated Press’s invitation-only gathering last year–left the audience with little appetite for food. Some found the beverage table a more welcome destination.

Anticipated Question 1: What’s “beyond search” mean?

In research conducted by me and others, about two-thirds of the users of an enterprise search system are dissatisfied with that system. “Beyond search” implies that we have to move to another approach because what is now available in organizations with which I and the other researchers have investigated is not well liked. Due to the cost of some systems, annoying two-thirds of the users is tantamount to getting a D or an F on a report card.

Anticipated Question 2: What’s “behind the firewall search” mean?

I wrote about the search elephant here. Many different functions involving information access are made available to an employee, contractor, or authorized user. The idea is that “behind the firewall search” is not public and made available by an organization to a select group of users. The “search elephant” refers to the many different ways in which search is understood and perceived within an organization.

Anticipated Question 3: Why are there so many search vendors and more coming each day?

There is a belief that existing systems are not tapping into what I have estimated to be a $2.5 billion market for information access in the enterprise. Entrepreneurs and people with money look at Google and think, “We should be able to make gains like that in the enterprise market.” I also think that the market itself is trying to figure out the search elephant. Buyers don’t know what is needed. When entrepreneurs, money, and confused customers with severe information access problems come together, we have the type of market place that exists today.

Anticipated Question 4: What about Microsoft and Fast Search & Transfer?

I understand that it is business as usual at Microsoft and Fast Search. For Microsoft, this means trying to get 10,000 motorboats to go in roughly the same direction. For Fast Search, the company continues to license its Enterprise Search Platform and service customers. There are many bits of grit in the working parts where Microsoft and Fast Search mesh. It is too soon to tell if these inhibitors are trivial or whether the machine will sputter, maybe stop. What I tell people is to ignore the Microsoft-Fast Search tie up, and get a solution for a SharePoint environment that works. There are good choices ranging from a lower cost solution like dtSearch to a competitively priced system from Coveo, Exalead, ISYS Search Software, or another Microsoft Certified vendor.

Anticipated Question 5: What’s the impact of the Google Search Appliance?

Many vendors will tell you that Google has delivered a second-class system. That’s not exactly true. With the OneBox API, Google has a very solid solution. The impact is that Google has about 10,000 enterprise customers. These are sales made, in many cases, under the noses of incumbent vendors. Google’s a player in the enterprise market and a serious one. I have uncovered one impactful bit of research at Google that could–note, I said, could–change the search landscape. I have tried to ask Google about this development, but the GOOG thinks I am do not merit their attention. Too bad for me, I guess.

Anticipated Question 6: What’s the impact of text processing, semantic search, and other new technologies on enterprise search?

These are hot terms that will open doors. Some vendors will make sales because of their ability to mesh trendy concepts with more traditional search.

Stephen Arnold, June 18, 2008

Search Rumor Round Up, Summer 2008

June 14, 2008

I am fortunate to receive a flow of information, often completely wacky and erroneous, in my redoubt in rural Kentucky. The last six months have been a particularly rich period. Compared to 2007, 2008 has been quite exciting.

I’m not going to assure you that these rumors have any significant foundation. What I propose to do is highlight several of the more interesting ones and offer a broader observation about each. My goal is to provide some context for the ripples that are shaking the fabric of search, content processing, and information retrieval.

The analogy to keep in mind is that we are standing on top of a jello dessert like this one.

jellow 2 brighter copy copy

The substance itself has a certain firmness. Try to pick it it up or chop off a hunk, and you have a slippery job on your hands. Now, the rumors:

Rumor 1: More Consolidation in Search

I think this is easy to say, but it is tough to pull off in the present economic environment. Some companies have either investors who have pumped millions into a search and content processing company. These kind souls want their money back. If the search vendor is publicly traded, the set up of the company or its valuation may be a sticky wicket. There have been some stunning buy outs so far in 2008. The most remarkable was Microsoft’s purchase of Fast Search & Transfer. SAS snapped up the little-known Teragram. But the wave of buy outs across the more than 300 companies in the search and content processing sector has not materialized.

Rumor 2: Oracle Will Make a Play in Enterprise Search

I receive a phone call or two a month asking me about Oracle SES10g. (When you access the Oracle Web site, be patient. The system was sluggish for me on June 14, 2008.)The drift of these calls boils down to one key point, “What’s Oracle’s share of the enterprise search market?” The answer is that its share can be whatever Oracle’s accountants want it to be. You see Oracle SES10g is linked to the Oracle relational database and other bits and pieces of the Oracle framework. Oracle’s acquisitions in search and retrieval from Artificial Linguistics more than a decade ago to Triple Hop in more recent times has given Oracle capability. As a superplatform, Oracle is a player in search. So far this year, Oracle has been moving forward slowly. An experiment with Bitext here and a deployment with Siderean Software there. Financial mavens want Oracle to start acquiring search and content processing companies. There are rumors, but so far no action, and I don’t expect significant changes in the short term.

Read more

Hakia: Pulled by Medical Information Magnetism

June 13, 2008

A colleague and I visited with the Hakia team last summer after the BearStearns’ Internet conference. I’ve tracked the company with my crawlers, but I have not made an effort to contrast Hakia’s approach with that of Powerset, Radar Networks, and the other “semantic” engines now in the market.

I received a Hakia news release today (June 12, 2008), and I noticed that Hakia is following the well-worn path of many commercial databases in the 1980s. The point that jumped out at me is that Hakia is adding content to its index; specifically, the PubMed metadata and abstracts. This is a US government database, and it has a boundary. The information is about health, medicine, and closely related topics. Another advantage is that PubMed like most editorially-controlled scientific, technical, and medical databases has reasonably consistent indexing. Compared to the wild and uncontrolled content available on Web sites and from many “traditional” publishers, this content makes text processing [a] less computationally intensive because algorithms don’t have to figure out how to reconcile schema, find concepts, and generate consistent metadata. [b] Data sets like PubMed have some credibility. For example, we created a test Web site five years ago. We processed some general newspaper articles, posted them, and used the content for a text of a system called ExploreCommerce. Then we forgot about the site. Recently someone called objecting to a story. The story was a throw away, and not intended to be “real”. But if it’s on the Internet, it must be true echoed in this caller’s mind. PubMed has editorial credibility, which makes a number of text processing functions somewhat more efficient.

Kudos to Hakia for adding PubMed. You can read the full news release here. You can try the Hakia health and medical search here.

Several observations will highlight my thoughts about this Hakia announcement:

  1. The PR fireworks about semantic search have made the concept familiar to many people. The problem is that semantic search for me is a misnomer. Semantic technology, I think, can enhance certain content processing operations. I am still looking for a home run semantic search system. Siderean’s system is pretty nifty, and its developers are careful to explain its functionality without the Powerset-Hakia type of positioning. I know vendors will want to give me demonstrations and WebEx presentations to show me that I am wrong, but I don’t want any more dog and pony shows.
  2. My hunch is that using bounded content sets–Wikipedia, specific domains, or vertical content–allows the semantic processes to operate without burdening the companies with Google-scaling challenges. Smaller content domains are more economical to index and update. Semantic technology works. Some implementations are just too computationally costly to be applicable to unbounded content collections and the data management problems these collections create.
  3. Health is a hot sector. Travel, automobiles, and finance offer certain benefits for the semantic technology company. The idea is to find a way to pay the bills and generate enough surplus to keep the venture cats from consuming the management team. I anticipate more verticalization or narrow content bounding. It is cheaper to index less content more thoroughly and target a content domain where there is a shot at making money.
  4. It’s back to the past. I find the Hakia release a gentle reminder of our play at the Courier Journal & Louisville Times Co. with Pharmaceutical News Index. We chose a narrow set of content with high value to an easily identified group of companies. The database was successful because it was narrow and had focus. Hakia is rediscovering the business tactics of the 1980s and may not even know about PNI and why it was a money maker.

I’m quite enthusiastic about the Hakia technology. I think there is enormous lift in semantics in the enterprise and Web search. The challenge is to find a way to make semantics generate significant revenue. Tackling content niches may be one component of financial success.

Stephen Arnold, June 13, 2008

Silobreaker: Breaking thorough Information Access Silos

June 12, 2008

Silobreaker is an information access system that pushes the limits of search, content processing, and text analysis. The company makes its system available here. You can launch queries and manipulate a range of features and functions to squeeze meaning and insights from information.

Mats Bjore–former Swedish intelligence officer and McKinsey & Co. knowledge management consultant–asserts that certain types of “real world” questions may be difficult for search systems to answer. Echoing Google’s Dr. Peter Norvig, Mr. Bjore believes that human intelligence is needed when dealing with facts and data. He told Beyond Search:

We always emphasize the importance of using our technology for decision-support, not to expect the system to perform the decision-making for you. The problem today is that analysts and decision-makers spend most of their time searching and far too little time learning from and analyzing the information at hand. Our technology moves the user closer to the many possible “answers” by doing much of the searching and groundwork for them and freeing up time for analysis and qualified decision-making.

The low-key Mr. Bjore demonstrated the newest Silobreaker features to Beyond Search. Among the features that caught our attention was the point-and-click access to link analysis, mapping, and a useful “trends search” function.

Mr. Bjore said:

The whole philosophy behind Silobreaker is to move away from the traditional keyword based search query which generates just page after page of headline results and forces the user into a loop of continually adjusting the query to find relevance and context. We see the keyword-based query as a possible entry point, but the graphical search results enable the user to discover, navigate and drill down further without having to type in new keywords. No-one can imagine managing numerical data without the use of descriptive graphical representations, so why do we believe that we can handle vast quantities of textual data in any other way? Well we don’t think we can, and traditional search is proving the point emphatically. Today’s Silobreaker is just giving you a first glimpse of how we (and I’m sure others) will use graphics to bring meaning to search results.

Explaining sophisticated information access systems is difficult. Mr. Bjore drew an analogy that provides a glimpse of how technology extends the human decision mechanism. He said:

Silobreaker works like one of our dogs. Their eyes see what is in front of you, the ears hears the tone of voice, the nose smells what has happened, what is now and what’s around the corner.” I agree. Silobreaker is more than search; it’s an extension of the information envelope.

media trends

This graphic shows the key trends in the content processed by the system in a period specified by the user. When the system processes an organization’s proprietary information, a user can see at a glance what the key issues are. Silobreaker can combine internal and external data so that trend lines reflect trends from multiple sources.

The system is available to commercial organizations via a software as a service or an on-premises installation. Mr. Bjore characterized the pricing of the service as “very competitive.” You can contact the company by telephoning either the firm’s London, England, office at +44 (0) 870 366 6737 or the firm’s Stockholm, Sweden, office at +46 (0) 8 662 3230. If you prefer email, write sales at silobreaker dot com. More information about the company is here. Like Cluuz.com, Silobreaker ushers in the next-generation in information access and analysis.

Silobreaker: Sophisticated Intelligence

June 12, 2008

An Interview with Mats Bjore, Silobreaker

I met Mats Bjore at a conference seven, maybe eight years ago. The majority of the attendees were involved in information analysis. Most had worked from government entities or commercial organizations with a need for high-quality information and analysis.

bjore dogs copy

Mr. Bjore and two of his dogs near his home in Sweden.

I caught up with Mats Bjore, one of the wizards behind the Silobreaker service which I profiled in my Web log, in the old town’s Café Gråmunken. Since I had visited Stockholm on previous occasions, I asked the waiter for a cheese plate and tea. No herring for me.

Over the years, I learned that Mr. Bjore shared two of my passions: high-value intelligence and canines. One of my contacts told me that his prized possession is shirt emblazoned with “Dog Father”, crafted by his military client.

Before the waitress brought our order, I asked Mr. Bjore about his interest in dogs and Sweden’s pre-occupation with herring in a mind-boggling number of guises. He laughed, and I turned the subject to Silobreaker.

Silobreaker is one of a relatively small number firms offering a combination intelligence-centric solution to clients and organizations worldwide. One facet of the firm’s capabilities stems from its content processing system. The word “search” does not adequately describe the system. Silobreaker generates reports. The other facet of the company is its deep expertise in information itself.

The full text of my conversation with Mats Bjore appears below:

Where did the idea for Silobreaker originate?

Silobreaker actually has a long history in the sense of the word Silobreaker. When I was working in the intelligence agency and later at McKinsey & Co I was amazed of the knowledge silos that existed totally isolated from each other. I saw the promise of technology to assist in unlocking those Silos, however the big names at that time, Autonomy, Verity, Convera etc failed to deliver, big time… Disappointed and waiting for the technology of the future I registered the name Silobreaker.com , more like a wish for the perfect system. A couple of years later in 2003-2004 I was approached by a team of amazing people–Per Lindh, Björn Löndahl, Jimmy Mardell, Joakim Mårlöv and Kristofer Mansson. These professionals wanted to further develop their software into an intelligence platform. In 2005 my company Infosphere and the software company Elucidon joined forces and we created Silobreaker Ltd as a joint venture. One year later we consolidated software, service and consulting to one brand–Silobreaker.

Today, Silobreaker enables the breaking down of silos built from informational, knowledge, or mental bricks and mortar.

What’s your background?

I am a former lieutenant colonel in the Swedish Army. I was detailed to the Swedish Military intelligence Agency where I founded the Open Source Intelligence function in 1993.

After leaving the government, I became the Scandinavian Knowledge Manager for McKinsey & Company. After several years at McKinsey, I started my own company. Infosphere and the service Able2Act.com.

I am also a former musician in a group called Camping with Penguins. You know that I am a lover of dogs. Too bad you like boxers. You need to get a couple of my friends so you have a real dog. I’m just joking.

I know. I know. What are the needs that traditional search engines like Autonomy, Endeca, and Fast Search (now Microsoft) are not meeting?

Meaning and context and I would also say that traditional engines requires that you always know what to search for. You need to be an expert in your field to make to fully take advantage of the information in databases and unstructured text. With the Silobreaker technology the novice becomes an expert and the expert becomes a discoverer, It might sound like a sales pitch, but its true. Every day in my daily work I have need to jump into new areas, new industries and topics. There is no way that I can formulate keyword search nor have the time to digest 100 or a 1,000 articles that works in the mode of click and read, click and read. With Silobreaker and its technology I start very broadly and the system directly helps me to understand the context of large set of articles in different formats, from different repositories, from different topics. We call this a View 360 with an In Focus summary. Note: here’s an InFocus example provided to me after the interview.

In Focus

When I search in traditional systems based on the search/ read philosophy, I spend to much time searching and reading and too little time of sense making and analysis. With Silobreaker, I directly start with that process and I create new value, for me and for my clients.

In a conversation with one of the Big Brands in enterprise search, the senior VP told me that services producing answers are “just interfaces”. Do you agree?

“Just interfaces” might be a bit harsh on the companies that actually try to provide direct answers to searches – they actually have some impressive algorithms, but to a certain extent we agree. We simply don’t think that an “answer engine” solves any real information overload problem.

If you want to know “What’s the population of Nigeria” – fine, but Wikipedia solves that problem as well. But how do you “answer” the question “What’s up with iPhone”? There are many opinions, facts, news items, and blogs “out there”. Trying to provide an “answer” to any “question” is very hard to do, maybe futile.

We always emphasize the importance of using our technology for decision-support, not to expect the system to perform the decision-making for you. The problem today is that analysts and decision-makers spend most of their time searching and far too little time learning from and analyzing the information at hand. Our technology moves the user closer to the many possible “answers” by doing much of the searching and groundwork for them and freeing up time for analysis and qualified decision-making. Note: This is a 360 degree view of news from Silobreaker provided after the interview.

360 of an article

There’s significant dissatisfaction among users of traditional key word search systems. What’s at the root of this annoyance?

The more information that is generated, duplicated, recycled, edited and abstracted and in combination with the rapid proliferation of “ I never use a spell checker and I write in your language with my own set of grammar”—– the need for smarter system to actually find what you are looking for will increase. In a couple of years from now, we also see the demise of the mouse and keyboard and the emergence of other means of input, the keyword approach is not just it.

Keyword based search works reasonably well for some purposes (like finding your nearest Swedish herring restaurant), but as soon as you take a slightly more analytical approach it becomes very blunt as a tool.

There is no real discovery or monitoring aspect to keyword based search. The paradox is that you’ll need to know what you’re looking for in order to discover.

Matching keywords to documents doesn’t bring any meaning to the content nor does it put the content in context for the user.

Keyword based search is a bottom-ups approach to relevance. The burden is put entirely on the user to dissect large results in order to find relevant articles, who the key players are, how they relate to each other, and other factors.

This burden creates the annoyance and “research fatigue” and as a result users rarely go beyond the first page of results – hence the desperate hunt amongst providers for PageRank, but which may have no or little bearing on the users real needs.

The intelligence agencies in many ways are the true professionals in content analysis. Why have the systems funded by IN-Q-TEL, Interpol, and MI5/MI6 not caught on in the enterprise world?

These systems are often complex and their “end solutions” are often mix of different software that is not well integrated. We already see a change with our technology. Some government customers look at our free service at Silobreaker.com and have a chance to explore how Silobreaker works without sales people hovering over them.

We want our clients to see one technology with its pieces smoothly integrated. We want the clients to experience information access that, we believe, is far beyond our competitors’ capabilities.

Intelligence agencies have often acquired systems that are too complex and too expensive for commercial enterprises. Some of these systems have been digital Potemkins. These systems provide the user with no proof about why a certain result was generated.

Now, this “black box” approach might be okay when you have a controlled content set, like on the classified side within the intelligence community. But the “real world” needs to makes sense of unstructured information here and now.

You have plus 100,000 major companies in the world, and you have 200 or so countries. Basically the need for technology solutions is the same. For me it’s totally absurd that the government complicates their systems instead of looking at what is working here and now.

Furthermore I think one the reasons that government can do complex and sometimes fruitless projects is that some agencies don’t have to make money to survive. The taxpayers will solve that.

In the commercial sector–profit and time are essential. Another factor that corporations take into account when investing in a system such factors as ease of use.

With the usually high turnover in any industry, a system must be easy to use in order to reduce training time and training costs. In some government sectors, turnover is much lower. People can spend a great deal of time learning how to use the systems. Does this match your experience?

Yes, and I agree with your analysis. I had an email exchange with the chief technical officer of a major enterprise search vendor. He asserted that social search was the next big thing. When I pointed out that social search worked when the system ingested a large amount of information, much available covertly, he argued that general social information was “good enough”? Do you agree?

No I don’t. Now we are talking about quality of the information. If you would index and cross reference XING, Facebook, Linkedin you could display fantastic displays of the connections….. However, how many of this links between people are actually true (in the sense that they actually have met or even have some common ground)?

There is a very large set of people that try to get as many connections as possible, thus diluting the value of true connections. I agree that you need a significant amount of information in order to get a baseline. You also need to validate this kind of data with reality checks in other kind of information sources – offline and online.

My main company, Infosphere, did some research into the financial networks in the Middle East, the fact-based search (ownership, shareholdings, etc) provided one picture, then you have to add the family and social connections, the view from media, then look at resident clusters and other factors. We had more than 8,000 dots ( people ) that we connected. But we were just scratching on the surface.

The graphic displays in Silobreaker are quite useful. In a general description, what are you doing to create on the fly different types of information displays?

The whole philosophy behind Silobreaker is to move away from the traditional keyword based search query which generates just page after page of headline results and forces the user into a loop of continually adjusting the query to find relevance and context.

We see the keyword-based query as a possible entry point, but the graphical search results enable the user to discover, navigate and drill down further without having to type in new keywords. No-one can imagine managing numerical data without the use of descriptive graphical representations, so why do we believe that we can handle vast quantities of textual data in any other way. Well we don’t think we can, and traditional search is proving the point emphatically. Today’s Silobreaker is just giving you a first glimpse of how we (and I’m sure others) will use graphics to bring meaning to search results.

Is Silobreaker available for on premises and for SaaS (software as a service”? What do you see as the future access and use case for Silobreaker?

That’s a good question. Let me say that Silobreaker’s business model is divided into three parts.

First, we have a free news search service that eventually will be add-supported but whose equally important role is to show-case the Silobreaker technology and function as a lead generator for the enterprise offerings.

Second, Our Enterprise Service which is due to be released in September or October 2008 is an online, real-time “clipping service” aimed at companies, banks, consultants as well as government agencies and that will offer a one-stop shop for news and media monitoring from defining what you are monitoring to in-depth content aggregation, analysis and report generation. This service will come with a SaaS facility that enables the enterprise to upload its own content and use the Silobreaker technology to view and analyze it.

Third, we offer a Technology Licensing option. This could range from a license to embed Silobreaker widgets in your own site to a fully operational local Silobreaker installation behind your firewall and customized for your purposes and for your content.

Furthermore, parts of the Silobreaker technology are available as SaaS on request.

Let’s talk about content. Most search systems assume the licensee has content. Is this your approach?

Yes and no, we can facilitate access to some content and also integrate crawling with third-party suppliers or if its very specific assist with specialty crawling.

On top of that we can, of course, integrate the fact sheets, profiles, and other content from my other venture, Able2Act.com which gives any system and any content set some contextual stability.

What are the content options that your team offers? Is it possible to merge proprietary content and the public content from the sources you have mentioned?

Yes, the ideal blend is internal and external content. And that really sets our team apart. Most of the Silobreaker group works with information as the key focus on a daily basis, sometimes 24×7 on certain projects. In other words, we are end users that keeps our ear to ground for information. Most companies out there are either tech people or content aggregators that just sell. We are both.

When you look forward, what is the importance of mobile search? Does Silobreaker have a mobile interface?

Mobile “search” is an extremely important field where traditional keyword-based search just doesn’t cut it. The small screen size of mobile devices, and limited (and sometimes cumbersome) input capabilities is just not suitable for sifting through pages of search results just to find that you need another Boolean operator and have to start all over again. We believe that users must be given a much broader 360 view of what they’re searching for in order to get to the “nugget” information faster. Silobreaker does not currently offer a mobile interface, but needless to say we’re working on it.

What are the major trends that you see emerging in the next nine to 12 months in content processing?

That’s a difficult question. I can identify several areas that seem important to my clients: Contextual processing, cross media integration, side-by-side translations, and smart visualization. Note: I have inserted a Silobreaker link view screen shot Mr. Bjore provided me after our conversation.

silogreaker link  map

Observations

Silobreaker caught my attention when I saw a demonstration of the system before it was publicly available. The system has become more useful to intelligence professionals with each enhancement to the system. Compared to laundry-lists of results, the Silobreaker approach allows a person working in a time-compressed environment to size up, identify, and obtain the information needed. The system’s “smart software” shows that Silobreaker’s learning and unlearning function is part of the next generation of information tools. After accessing information with Silobreaker, I am reminded that key word search is a medieval approach to 21st century problems. Silobreaker’s ability to assist a decision maker makes it clear that technology, properly applied, becomes a force multiplier without pushing human judgment to the sidelines. In one of our conversations, Mr. Bjore drew a parallel between Silobreaker and the canines for which he and I share respect and affection. He said, “Silobreaker works like one of our dogs. Their eyes see what is in front of you, the ears hears the tone of voice, the nose smells what has happened, what is now and what’s around the corner.” I agree. Silobreaker is more than search; it’s an extension of the information envelope. Take a close look at this extraordinarily good system here.

Stephen E. Arnold, June 12, 2008

The Semantic Chimera

June 8, 2008

GigaOM has a very good essay about semantic search. What I liked was the inclusion of screen shots of results of natural language queries–that is, queries without Boolean operators. Two systems indexing Wikipedia are available in semantic garb: Cognition here and Powerset here. (Note: there is another advanced text processing company called Cognition Technologies whose url is www.cognitiontech.com. Don’t confuse these two firms’ technologies.) GigaOM does a good job of making posts findable, but I recommend navigating to the Web log immediately.

Nitin Karandikar reviews both Cognition’s and Powerset’s approach, so I don’t need to rehash that material. For me the most important statement in the essay is this one:

There are still queries (especially when semantic parsing is not involved) in which Google results are much better than [sic] either Powerset or Cognition.

Let me offer several observations about semantic technology applied to constrained domains of content like the Wikipedia:

  1. Semantic technology is extremely important in text processing. By itself, it is not a silver bullet. A search engine vendor can say, “We use semantic technology”. The payoff, as the GigaOM essay makes clear, may not be immediately evident. Hence, the “Google is better” type statement.
  2. Semantic technology is in many search systems, just not given center state. Like Bayesian maths, semantic technology is part of the search engine vendors’ toolkits. Semantic technology delivers very real benefits in functions from disambiguation to entity extraction. As this statement implies, there are many different types of semantics in the semantic technology spectrum. Picking the proper chunk of semantic technology for a particular process is complicated stuff, and most search engine vendors don’t provide much information about what they do, where they get the technology, or how the engineers determined which semantic widget to use in the first place. In my experience, the engineers arrive at their job with academic and work experience. Those factors often play a more important part than rigorous testing.
  3. Google has semantic technology in its gun sights. In February 2007, information became available about Google programmable search engine which has semantics in its plumbing. These patent applications state that Google can discern context from various semantic operations. Google–despite its sudden willingness to talk in fora about its universal search and openness–doesn’t say much about semantics and for good reason. It’s plumbing, not a service. Google has pretty good plumbing, and its results are relevant to many users. Google doesn’t dwell on the nitty gritty of its system. It’s a secret ingredient and no user really cares. Users want answers or relevant information, not a lab demo of a single text processing discipline.
  4. Most users don’t want to type more than 2.2 words in a query. Forget typing well formed queries in natural language. Users expect the system to understand what is needed and the situation into which the information fits. Semantic technology, therefore, is an essential component of figuring out meaning and intention. Properly functioning semantic processes produce an answer. The GigaOM essay makes it clear that when the answers are not comprehensive, on point, or what the user wanted, semantic technology is just another buzz word. Semantic technology is incredibly important, just not as an explicit function for the user to access.

I talk about semantic technology, linguistic technologies, and statistical technologies in this Web log and in my new study for the Gilbane Group. The bottom line is that search doesn’t pivot on one approach. Marketers have a tough time explaining how their systems work, and these folks often fall back on simplifications that blur quite different things. Mash ups are good in some contexts, but in understanding how a Powerset integrates a licensed technology from Xerox PARC and how that differs from Cognition’s approach, simplifications are of modest value.

In my experience, a company which starts out as statistics only quickly expands the system to handle semantics and linguistics. The reason–there’s no magic formula that makes search work better. Search systems are dynamic, and the engineers bolt new functions on in the hope of finding something that will convert a demo into a Google killer. That has not happened yet, but it will. When a better Google emerges, describing it as a semantic search system will not tell the entire story. Plumbing that runs compute intensive processes to cruch log data and smart software are important too.

A demo is not a scalable commercial system. By definition a service like Google’s incorporates many systems and methods. Search requires more than one buzz word.You may also find the New York Times’s Web log post by Miguel Helft about Powerset helpful. It is here.

Stephen Arnold, June 8, 2008

Google: No Game Changer … Just Yet

June 5, 2008

Imagine my surprise when Computerworld picked up on information in my April 2008 Gilbane Group study, Beyond Search. You can read the Computerworld story here. (Hurry. Computerworld content can be hard to find if you dally. I won’t try to summarize the article nor will I comment on it beyond one modest observation.)

The GOOG bought Transformic. Transformic has some very prescient innovations. These are not new. In fact, the core insights date from the early 1990s. With the Google plumbing in place, XML and semi structured content processing in the bag, Google has to look beyond today. Never mind that Google’s competitors don’t have a clue what Google does on a day-to-day operational basis. The GOOG is the future.

The killer comment in the nice article by Chris Kanaracus is:

Inside an enterprise, and maybe unlike the Internet, you can know a lot about a user,” such as who they report to, said Matthew Glotzbach, director of product management for Google’s enterprise division. “There’s a lot of empirical information you can derive. All of that can be used to create a very, very rich profile about the user, which can then be used to create a really rich search experience.” Do not expect Google to suddenly bring a game-changing product to market, according to Glotzbach. “The model is not these kind of big-bang approaches where we work for multiple years and then roll something out. In terms of what we do in enterprise search, you’ll see a constant flow, as opposed to one sort of big bang — here’s a whole new thing,” he said.

Mr. Glotzbach was on a panel billed as a debate late last year. Ah, he’s a canny wordsmith that wizard be.

Mr. Glotzbach’s comment comes from the belly of a company planning to start building housing in the year 2013 on prime NASA real estate in Mountain View, Calif.

Time, to Google, means right now and really fast. Time also means the drip drip of incremental functions slipstreamed in apparently meaningless droplets. The pace will be Googley slow. You will need a time lapse camera to note the changes.

Should IBM, Oracle, and other giants in data management worry? Nope, executives at the companies told me that their knowledge of Google is rich, deep, and wide. I do have a nifty briefing about the Transformic technology. Interested? Write me at sa at arnoldit dot com.

A chipper quack to Computerworld for the reference to my new study.

Stephen Arnold, June 5, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta