Microsoft Research Search Research: Not a Typo

June 29, 2008

In Chicago, I heard two earnest 20-somethings in the Starbucks on Lincoln and Greenview in Chicago arguing about Microsoft search. The two whiz kids wanted to locate information about Microsoft’s Wed Data Management Group. Part of Microsoft’s multi-billion dollar research and development program, WDMG (sometimes abbreciated WSM0 works to crack tough problems in Web search.

The problem with Web search is that content balloon with each tick of the hyper fast Internet clock. The problem boils down to a several hundred megabytes every time slice. To make the problem more interesting, Web data changes. One example ignored by researchers is the facility with which a Web log author can change a posting. Some changes are omissions such as forgetting to assign a tag. Others are more Stalinesque. An author deletes, rewrites, or supplements an original chunk of a Web log. Today, I find more and more Web sites render pages in response to an action that I take. The example which may resonate with you is the operation of a meta search or federating system like Kayak.com. Until I set parameters for a trip, the system offers precious little content. Once I fill in the particulars of my trip, the rendered pages provide some useful information.

If you plan on indexing the Web, you have to figure out these dynamic pages, versions, updates, and new content. The problem has three characteristics. First, timeliness. When I do a query, I want current information. Speed, then, requires an efficient content identification and indexing system. If I lack the computing horsepower for brute force indexing, I have to use user cues such as indexing only the most frequently requested content. In effect, I am indexing less information in order to keep that index current.
Second, I have to be able to get dynamic content into my index. If I miss the information available that becomes evident in response to a curer, I am omitting a good chuck of the content. My tests show that more than half the sites in my test set are dynamic. The static HTML of the good old days makes up a smaller portion of the content that must be processed. Google’s work with Google Forms is that company’s first step into this type of data. Microsoft has its own approaches and some of this work is handled by the wizards at WSM or Web Search and Mining Group here.

Third, I also have to figure out how to deal with queries. When I talk about search, there are two sides to the coin. On one side is indexing. On the other side is converting the query to something that can be passed against the query. If a system purports to understand natural language as Hakia and Powerset assert, then the system has to figure out what the user means. Intent is not such a simple problem. In fact, deciphering a user’s query can be more difficult than indexing dynamic content. Human language is ambiguous. You would not understand my mother if you heard her say to me, “Quilling.” She means something quite specific, and the likelihood any system could figure out that this single word means, “Bring me my work basket” is close to zero unless the system in some ways has considerable information about her specific use of language.

As you probably have surmised, natural language processing is complicated. NLP is resource intensive. I need a capable indexing system and I need a powerful, clever way to clear up ambiguities. Human don’t type long queries, nor do professionals evidence much enthusiasm for crafting query strings that retrieve exactly what that professional needs. Users type 2.3 words and take what the system displays. Others prefer to browse an interface with training wheels; that is, Use For and See Also references and explore. The difference in the two approaches share one common element: a honking big computer with smart algorithms are needed to make search work.

Web Search and Mining

This Microsoft group works on a number of interesting projects related to content processing, text mining, and search. The group’s Web page identifies data management, dynamic data indexing, and and search quality as current topics of interest.

More detail about the group’s activities appear in the list of publicly available research papers. You can browse and download these. I want to comment about three aspects of the research identified on this Web site and then close with several observations about Microsoft research into search.
First, the sample papers date from 2004. I don’t know if the group has filtered its postings of papers, or if the group has been redirected.

Second, a number of papers discuss clustering. A representative paper is Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Analysis. The full paper is here. . The paper explains a system that accepts a query and then outputs a result. Each row is a cluster. Microsoft’s researchers are parsing a query and retrieving images. The images are displayed in a clustered visual display. You will notice that the lead Microsoft researcher worked with a Yahoo researcher and a University of Chicago researcher. You can browse the other clustering papers.

Third, another group of papers touches upon the notion of “information manifolds”. In the 1990s, the phrase “information manifold” enjoyed some buzz. The notion is that a “space” contains indexes which can be queried. One Microsoft paper–” Learning an Image Manifold for Retrieval”–applies the notion to images. Other papers touch upon the topic as well. I found this interest suggestive. Google has some activity is this subject as well.

I want to pick up the thread of WSM and research into “manifolds”. I turned first to Search.Live.com, Microsoft’s own search system and Google.com Microsoft-centric search sub system. You can find Microsoft’s search here and Google’s search sub system here . You may want stray into specialist Microsoft systems such as Libra here, a showcase for some new Microsoft technology. I tried several queries on the Microsoft Live.com search site and was able to locate the paper referenced above. One of the two hits I was able to track down returned a null set.

I ran the same queries on Google’s Microsoft collection. The Google result set contained more than 3,000 hits. Many were off topic. A bit of fiddling such as NOTting out “Weston-super-Mare” and using the term dataspace for manifold generated a tight result set. One useful document on this Microsoft research is here. You will need to rip the PostScript file to a PDF, however. One of the documents (MS Search: A Case Study of Research in MSR Asia” here) indexed by Google contained a snapshot of the team working on these projects along with information that the team would be adding staff. There is no indication how the accusation of Fast Search & Transfer, the reorganization of Microsoft’s search efforts, and the likely purchase of Powerset will fit in with theses research activities. I have found that many Microsoft Web pages carry no date and time stamp visible to the average Web user. The image shows some of the WSM team.

Observations

What does this quick look at one Microsoft research unit engaged in search suggest to me:
Microsoft information exists in a vacuum. There are few inbound links and few outbound links, no dates on the documents, and when dates appear these dates may or may not point to current information. The pages could be orphans or they could be significant signals about Microsoft’s research efforts. I was able to locate this important paper via Google but not via Microsoft.

Microsoft’s own search system does not index some Microsoft content that I would expect to find in the system. It was evident to me that a filter is operating. Another possibility is that either Microsoft does not index its own public facing servers as part of Live.com search or Microsoft has trimmed the number of urls the system indexes for some reason.

The functions described in the WSM pages I examined did not appear to be available to me when I ran the test query “Pluto” on the Live.com image function. In fact, the feature that was evident created severe problems for my Verizon high speed wireless access card from the Sabik’s where I overheard the conversation leading me to investigate this topic. The infinite scrolling is interesting but less useful than the image clustering discussed in the paper I referenced in this essay. You can experiment with this feature here.

Agree? Disagree? Use the comments section to get me back on track.

Stephen Arnold, June 29, 2008

Written by Stephen E. Arnold · Filed Under Database, Enterprise, Feature, Microsoft, Search, Semantic, Text processing

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.