Powerset Nails Search: A Very Bold Assertion

June 29, 2008

Chris Gaylord, writing for the Christian Science Monitor, updates a May 2008 essay, and emphasizes this point:

Google has been a bit dismissive of semantic search, preferring (for now at least) its quick keyword approach. But this Microsoft news puts a lot of weight – and $100 million – behind the notion that web users want to ask questions to a search engine, not just feed it keyword clues. We have yet to see if Microsoft will keep the Powerset name or, more likely, integrate the technology into its Live Search. That site certainly needs some help. The company has fought a losing battle against Google and Yahoo for years now. Despite its best efforts and even cash incentives, Microsoft has not been able to distinguish itself. Offering a strong semantic search option is a good way to reboot the challenge.

You can read the full document here.

You may recall that the original Ask Jeeves answered questions. Human figured out answers, put them in a file, and the Ask Jeeves’ system converted the user’s query to a form that could be matched against the canned answers. The buzz about this surged in the late 1990s, but the cost of the Ask Jeeves’ approach was high, and in my view, the system did not work very well.

The desire of information retrieval mavens to take a question, any question, and have software answer it makes some folks darned excited. The technology to answer questions continues to advance, and it is possible to get answers from a number of different systems. I have participated in meetings where smart people much more enthusiastic than I argued about the importance of having a system answer a question.

I have written about NLP or natural language processing in the first three editions of the Enterprise Search Report, and I added some information in my April 2008 Beyond Search study for Gilbane Group. Let me offer some observations:

  1. I don’t type queries into search engines. I prefer Boolean statements and point-and-click interfaces that let me “see” what’s in an indexed corpus. My experience is that typing questions is not too popular, nor is the notion of chopping text from an article and letting a search system find “more like this”. I have an installation of the Brainware trigram system, and it is useful–far more useful to me than asking “When did Columbus discover America?” if indeed he did. No NLP system can make much sense of a short query in the context of archaeological research about pre-Kit Columbus visitors to the North American landmass. Nope, that type of question answering will take a bit more lab work.
  2. NLP imposes considerable computational load on both the document indexing subsystem and the query processing subsystem. I saw an impressive set of PowerPoint slides at the 2007 BearStearns’ Internet conference, and I fiddled with the Wikipedia demonstration in 2008. What I have not seen is proof that Powerset’s amalgam of Xerox technology and its proprietary code scaling. Without scaling, NLP is likely to remain interesting but of little use to me.
  3. Microsoft, like Yahoo, is now in the business of collecting search technologies. There are two “flavors” of SharePoint search. There is the Fast Search & Transfer technology. There is the whizzy new Live.com search. There is search in XP, in Vista, in SQL Server, and probably other search technologies I don’t know about. Toss in Powerset. What the collection resembles is a yard sale, not an exhibit of Etruscan tomb art at the British Museum. Search has to be more than a yard sale in its design, architecture, and technical framework. The cost of integrating this stuff is more than my check book can support.

I appreciate the enthusiasm for Microsoft becoming more competitive. Let us not forget that Google has been doing pretty much the same thing–it’s one trick pony show–for a decade. With two thirds of the market for Web search, Microsoft has some work to do to become a number two in search. Google continues to seep into the enterprise via osmosis. Let’s face facts. Customers have to buy from Google. Google is not very good at sales, customer support, or communicating what its gizmos can do. Microsoft is a good sales organization, but it is watching Google challenging its enterprise revenue the way spilled ink spreads on a white table cloth. And, Google has serious semantic technology which is a widget in a larger data management solution at Google.

Keep cheerleading for Microsoft. Just keep the challenges of NLP in mind. Agree? Disagree? Let me know so I can learn what I don’t now know.

Stephen Arnold, June 30, 2008

Microsoft Research Search Research: Not a Typo

June 29, 2008

In Chicago, I heard two earnest 20-somethings in the Starbucks on Lincoln and Greenview in Chicago arguing about Microsoft search. The two whiz kids wanted to locate information about Microsoft’s Wed Data Management Group. Part of Microsoft’s multi-billion dollar research and development program, WDMG (sometimes abbreciated WSM0 works to crack tough problems in Web search.

The problem with Web search is that content balloon with each tick of the hyper fast Internet clock. The problem boils down to a several hundred megabytes every time slice. To make the problem more interesting, Web data changes. One example ignored by researchers is the facility with which a Web log author can change a posting. Some changes are omissions such as forgetting to assign a tag. Others are more Stalinesque. An author deletes, rewrites, or supplements an original chunk of a Web log. Today, I find more and more Web sites render pages in response to an action that I take. The example which may resonate with you is the operation of a meta search or federating system like Kayak.com. Until I set parameters for a trip, the system offers precious little content. Once I fill in the particulars of my trip, the rendered pages provide some useful information.

If you plan on indexing the Web, you have to figure out these dynamic pages, versions, updates, and new content. The problem has three characteristics. First, timeliness. When I do a query, I want current information. Speed, then, requires an efficient content identification and indexing system. If I lack the computing horsepower for brute force indexing, I have to use user cues such as indexing only the most frequently requested content. In effect, I am indexing less information in order to keep that index current.
Second, I have to be able to get dynamic content into my index. If I miss the information available that becomes evident in response to a curer, I am omitting a good chuck of the content. My tests show that more than half the sites in my test set are dynamic. The static HTML of the good old days makes up a smaller portion of the content that must be processed. Google’s work with Google Forms is that company’s first step into this type of data. Microsoft has its own approaches and some of this work is handled by the wizards at WSM or Web Search and Mining Group here.

Third, I also have to figure out how to deal with queries. When I talk about search, there are two sides to the coin. On one side is indexing. On the other side is converting the query to something that can be passed against the query. If a system purports to understand natural language as Hakia and Powerset assert, then the system has to figure out what the user means. Intent is not such a simple problem. In fact, deciphering a user’s query can be more difficult than indexing dynamic content. Human language is ambiguous. You would not understand my mother if you heard her say to me, “Quilling.” She means something quite specific, and the likelihood any system could figure out that this single word means, “Bring me my work basket” is close to zero unless the system in some ways has considerable information about her specific use of language.

As you probably have surmised, natural language processing is complicated. NLP is resource intensive. I need a capable indexing system and I need a powerful, clever way to clear up ambiguities. Human don’t type long queries, nor do professionals evidence much enthusiasm for crafting query strings that retrieve exactly what that professional needs. Users type 2.3 words and take what the system displays. Others prefer to browse an interface with training wheels; that is, Use For and See Also references and explore. The difference in the two approaches share one common element: a honking big computer with smart algorithms are needed to make search work.

Web Search and Mining

This Microsoft group works on a number of interesting projects related to content processing, text mining, and search. The group’s Web page identifies data management, dynamic data indexing, and and search quality as current topics of interest.

More detail about the group’s activities appear in the list of publicly available research papers. You can browse and download these. I want to comment about three aspects of the research identified on this Web site and then close with several observations about Microsoft research into search.
First, the sample papers date from 2004. I don’t know if the group has filtered its postings of papers, or if the group has been redirected.

Second, a number of papers discuss clustering. A representative paper is Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Analysis. The full paper is here. . The paper explains a system that accepts a query and then outputs a result. Each row is a cluster. Microsoft’s researchers are parsing a query and retrieving images. The images are displayed in a clustered visual display. You will notice that the lead Microsoft researcher worked with a Yahoo researcher and a University of Chicago researcher. You can browse the other clustering papers.

Third, another group of papers touches upon the notion of “information manifolds”. In the 1990s, the phrase “information manifold” enjoyed some buzz. The notion is that a “space” contains indexes which can be queried. One Microsoft paper–” Learning an Image Manifold for Retrieval”–applies the notion to images. Other papers touch upon the topic as well. I found this interest suggestive. Google has some activity is this subject as well.

I want to pick up the thread of WSM and research into “manifolds”. I turned first to Search.Live.com, Microsoft’s own search system and Google.com Microsoft-centric search sub system. You can find Microsoft’s search here and Google’s search sub system here . You may want stray into specialist Microsoft systems such as Libra here, a showcase for some new Microsoft technology. I tried several queries on the Microsoft Live.com search site and was able to locate the paper referenced above. One of the two hits I was able to track down returned a null set.

Read more

Google CFO Search

June 29, 2008

The Washington Post has an interesting essay by Joseph Weisenthal (PaidContent.org) here. The story asks and answers the question, “Google’s CFO Search: Why’d It Take So Long?” This is an important bit of thinking and I urge you to read the full write up.

The short answer is risk. Mr. Weisenthal writes:

In a free-wheeling culture like Google’s, it would be up the CFO to be the stern taskmaster?basically, the parent or teacher that nobody likes cause they actually enforce all the rules. Ribstein adds: “The problem under SOX is that a CFO has to worry about what hedoesn’tknow ? that’s what Butler and I have called SOX’s “litigation time bomb.”

In my KMWorld column, which will appear in September 2008, I talk about Google’s transparency push. Google’s executives have been chatty Kathies in Israel, California, Washington, DC, and anywhere a journalist or three and an audience will listen.

Mr. Weisenthal’s discussion of snail-like CFO vetting clanks against the apparent transparency of Google. The last line of the essay nails the issue squarely:

You can see why it might not appeal to someone who from a typical CFO’s background, given the current regulatory environment.

The non-traditional approach of Google is working well. Will it work in the CFO’s office?

Stephen Arnold, June 29, 2008

Google Vulnerabilities

June 29, 2008

Seeking Alpha has an interesting discussion of chinks in Googzilla’s armor. The essay “Does Google Have a Weakness Microsoft Can Exploit? is here. The analysis touches upon my listing of Google weaknesses which first appeared in The Google Legacy, which I updated in Google Version 2.0, 2005 and 2007 respectively.

The part of the analysis that I found interesting touches upon Microsoft’s cash back. The idea of buying market share is not new, and I think Microsoft may expand its efforts in this area. The question for me, Is Microsoft able to see the buying market share through to its logical end; that is, to win may require sucking resources from other Microsoft initiatives. Such a shift could create a weaker Microsoft and one that is vulnerable not to Google but to other firms salivating at the idea of a weaker, distracted Microsoft.

Stephen Arnold, June 29, 2008

50 Niche Search Engines

June 28, 2008

Alisa Miller has compiled a list of 50 niche search engines. You can find the listing on Accredited Degrees here. Ms. Miller groups the search engines, which adds to the usefulness of her list. As I worked my way through the links, two of her finds struck me as useful:

  • Bookmatch provides search results from 3,300 sources with spam and silliness removed from Web log postings and news aggregators.
  • Congoo delivers results results from news and other sources. The company claims a higher level of information. My test queries returned useful results.

A happy quack to Ms. Miller for her list.

Stephen Arnold, June 29, 2008

IDC’s Database Market Share Analysis

June 28, 2008

IDC’s Chris Kanaracus has summarized the relational database market size in “Oracle Maintains Lead in Database Market”. You can read the full round up here. The total market tallies an estimated $19 billion. For me, the most important data in the news story is:

Oracle once again took the top spot, capturing 44.3 percent of the market with revenue growth of 13.3 percent. IBM came in second with a 21 percent share, also logging a 13.3 percent revenue growth rate. It was followed by Microsoft, with 18.5 percent of the market and a 14 percent jump in revenue. Sybase and Teradata rounded out the top five, garnering market shares of 3.5 percent and 3.3 percent, respectively.

My question is, How long will the traditional database vendors remain in ascendancy. The volume of data choking enterprises is increasing. The traditional row-and-column data tables remain administrative headaches. Basic queries often require hours, days, or weeks to execute after data cubes are built, queries written and debugged, and end users given a chance to review the reports.

The traditional database vendors are not solving the data management problems their licensees face. The companies in the IDC Top Five are creating an appetite for a different approach. Who will emerge with a soluition? The work I am going points to some newcomers. I have written about Aster Data and other firms with different angles of attach on the growing database problem. One thing I have learned is that the incumbents think their market positions are unassailable. These companies, despite their grip on the market, are dead wrong. This is not an innovator’s dilemma. This is the ostrich response: put the head in the sand and the outside world can’t be seen.

Stephen Arnold, June 28, 2008

Silobreaker Rumor

June 28, 2008

With Powerset off the search chess board, Really Simple Sidi asks, Will Silobreaker be the next information access vendor to be acquired? The question did not spring from thin air. Silobreaker executives have spoken with a number of companies about its technology. I will track Silobreaker more closely. You can read an interview with one of the company’s founders here.

Stephen Arnold, June 28, 2008

The Whale and the Walrus: Two Views of Sergey and Larry

June 28, 2008

The purpose of this essay is to describe the life trajectory of two technology-centric companies. I don’t want to mention the firms by name, but you may be able to guess which company is the whale and which is the walrus.

The whale is a big creature, a whale of a company. Wherever the whale goes, it gets its way. More accurately, the whale used to get its way. Now the whale is lying on its side near the Seattle waterfront close to upscale boutiques and a Starbucks.

The second is a walrus, now quite old for a semi-leviathan. The walrus prefers to sit on a rock not far from Half Moon Bay, soak up the sun and snag whatever fish get too close. The walrus prefers to conserve its energy. Oh, the walrus will stretch and sometimes roar. Most of the time, the walrus half sits, half reclines looking — well –disconnected from the world beyond the sand bar. The walrus has some new friends named Sergey and Larry.

Let’s look at three aspects of each creature and then think about the future of each powerful beastie.

The Whale

The whale is the largest mammal. Not surprisingly, the whale is never sure if a sucker fish is tagging along for a free ride. The whale is also not really aware of its surroundings. The whale sings and tries to find other whales, but whales get together once in a while. Think of it as a Warren Buffet cocktail party with only whales allowed. Otherwise whales think whale thoughts, oblivious to their world.

Our whales know that tiny creatures can annoy a whale, but tiny creatures rarely hurt a whale. This whale believes it is master of all the known universe. The trick is to stay away from tiny creatures with weapons that can make life difficult. Every once in a while, the whale can gobble a tasty morsel like Fast Search & Transfer. Life has been good, but the whale senses trouble in a restless ocean.

The Walrus

The walrus is tired. The old game of providing tips to lost dolphins and tuna is not working any more. So, the walrus kicks back and thinks about what might have been.

The walrus is old, and the new ways of finding young fish eager to learn the old ways are tiring. This walrus prefers to lay down, make some noise, and wait for the next meal. Think of this walrus living in an assisted-living facility. The real world is too unfamiliar. The walrus has two new friends, Sergey and Larry. Sergey and Larry bring the walrus fish once a day. Getting fish is better than catching fish. The walrus likes not working too hard. The rock is a fine place. The waves lapping the beach in Half Moon Bay sooth the walrus. The walrus changes position but does not move.

Interpreting the Two Stories

The whale is a company that is disconnected from the world beyond the ocean. The whale is, for the first time in its life, unnerved, maybe frightened. Sergey and Larry people have a different business model. Customers use software and information and an advertiser pays the bill. The whale wants to swat Sergey, Larry with its tail. Sergey and Larry dance out of the way. The whale is frustrated and getting tired carrying the old business model into every skirmish and chase.

The walrus is an old timer in the digital world. The spring and bounce have been weighted down by wild and crazy decisions. Walrus friends are leaving the walrus more and more alone. The walrus is isolated. The old ways have lost their zip. The walrus remembers reading about automobiles and buggy whip manufacturers. The walrus believes that he might become a wallet, maybe a pair of shoes. Change, however, is hard at the walrus’ age. The walrus stays where it is, moving to catch the rays of the setting sun. Sergey and Larry will bring another fish today.

The message is clear. The whale is going to fight to survive. The walrus has given up. Sergey and Larry have the ability to deal with both the whale and the walrus with equal aplomb.

Observations

Neither creature has many years left. You have to admire the fighting whale. Too bad its own weight and mass will sap his strength. Not much future unless the whale shed some pounds like Subway’s Jared, the tuna eater. The walrus has found a new best friend and does not want to work too hard. The walrus will gladly do what Googzilla says. Those free fish are really tasty, thinks the walrus.

And what about Sergey and Larry in their “we’re just guys” outfit. Sergey and Larry want to out think the whale. The walrus seems happy as long as he gets a couple of fish every day.

In the great theater of business, the whale and the walrus are sushi.

Stephen Arnold, June 28, 2008

More SharePoint Goodness

June 28, 2008

My posts about Microsoft SharePoint, the polymorphic content-collaboration-KoolAid system from Redmondians, are popular. A helpful but anonymous reader pointed us to useful information about how to figure out how much storage you will need for your SharePoint installation. Oh, you did not know that SharePoint storage needed special care and feeding. Well, once you get SharePoint up and running, you will become enlightened pretty darn quick.

Navigate to Sanjive Nair’s MSDN Web log here and download the text and the links. Believe me, trying to locate these on the Microsoft Web sites takes some serious work. Mr. Nair reviews the places where storage chokepoints typically occur; for example, your favorite database, SQL Server.

Mr. Nair addresses search as well. He writes:

Searching is extremely important for most portals. You would need to have a good understanding of your search requirements, while you are planning for capacity. Most importantly while planning for capacity you need to understand how much data will be indexed. While SharePoint can crawl and index Web Sites, Exchange folders, file folders BDC etc., the storage/search requirements may vary. For instance if you are indexing BDC content or large text files the index size may be larger compared to indexing Power Point files. By far in a SharePoint farm index server is the most processor intensive . So you need to provide enough processing power and memory to handle the indexing and crawling process. An Index server requires a Web Front Server which will serve the content while indexing. By default the Web Front End machines in your farm are set up to perform this task. However it may be beneficial to set up your index server as the Web Front End to perform this task, as it would avoid index server going over the network during the crawling process.

Mr. Nair provides a link to supplementary information which is a tough one to locate using Microsoft’s own search tools. The link to Microsoft’s search training videos is quite useful. I must admit that I did not watch any videos because system latency with my whiz bang Verizon wireless card prefers text to videos.

Kudos to Mr. Nair and a bowl of steaming burgoo when you come to Harrod’s Creek, high-technology center of Kentucky.

Stephen Arnold, June 28, 2008

SharePoint Placemat

June 28, 2008

Microsoft SharePoint got to know one another several years ago. Via referral, a Microsoft Gold Certified Partner wanted my team and me to run some tests on a SharePoint application. We got everything running, wrote our report, and the Gold Certified Partner was a quick pay.

After the project, one of my colleagues remarked, “SharePoint is really complex.” We put the idea aside until someone emailed us a SharePoint placement. A copy of this remarkable diagram is available if you want to look at it. You can find it in SharePointSearch.com here.

Here is a thumbnail of the full diagram, but I strongly urge you to download the diagram. Do you think it is a joke of some type? My colleagues and I saw something similar from a Microsoft partner in New Zealand a year ago, but this placemat is a triumph of sorts. The company preparing the diagram is Impac Systems Engineering.

impact placemat

The complexity of search in general and SharePoint in particular is an interesting topic. Search can be quite a challenge. One recent example is the inability of Internet Explorer to open a SharePoint document. You can read more here and download a fix here. Embedding search into a content and collaboration system with data management features may push the boundaries of software to their limits.

CleverWorkArounds.com has an essay called “Why Do SharePoint Projects Fail”. You can look at Part 5 here. I was unable to locate the other portions of this discussion, however. (Part 3 is here.) For me, there are three main points that address the issue of the almost-funny placemat diagram:

  1. The skills required to implement SharePoint include “IIS, Windows Server, TCP/IP & networks, SQL Server 2005 Advanced Administration, Firewalls, Proxies, Active Directory, Authentication, Security, IT Infrastructure Design, Hardware, Performance Monitoring, Capacity Planning, Workflow, IE, Firefox, Office Client tools, ASP.NET, HTML, JavaScript, AJAX, XSL, XSLT, Exchange/SMTP, Clustering, NLB, SANs, Backup Solutions, Single Sign on, Monitoring & Troubleshooting, Global Deployments, Dev, Test, Staging, Production – Staged deployments, ITIL, Vitalization.”
  2. “SharePoint is complex and the products it relies on are also complex. In the wrong infrastructure/architect hands, this can cause costly problems.”
  3. “… if there is not a certain degree of discipline around change management, configuration management, procedures, standards and guidelines to administrators, users, site owners and developers, bad things will happen.”

These points underscore the problem with “boil the ocean” systems. The fire needed to get water sufficiently hot to cook eggs can consume the pot, leading to a big mess.

Observations

I took another look at the placemat diagram and re read Part 3 and Part 5 of the essay “Why Do SharePoint Projects Fail?” Let me offer several observations from my dirt floor cabin in the hills of rural Kentucky:

First, SharePoint is a beast. Enterprise search is a monster. What will the progeny of these two behemoths be like? My opinion is that it will be tough to see through the red ink flooding some SharePoint projects. Toss in a hugely complex system such as Fast Search & Transfer’s Enterprise Search Platform, and you have a very interesting challenge to resolve.

Second, complexity is a Miracle Grow for consultants. SharePoint is complex, and it will probably only get more complicated. In my experience, Microsoft software becomes efflorescent quickly.

Finally, SharePoint attempts to deliver what may be a system that will be out of step with cloud-based services. SharePoint as a hosted or cloud-based service is generating some buzz. However, will the latency present in most on-premises installations be an issue when delivered as a service? My view is that latency, more than issues of security or data confidentiality, will bog down the SaaS implementation of SharePoint.

SharePoint is hugely successful. I heard that there are more than 65,000 licenses in North America alone. The SharePoint market is a tempting one for companies like Google to consider as one ripe for an alternative.

Stephen Arnold, June 27, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta