Francisco Corella, Pomcor, an Exclusive Interview

February 11, 2009

Another speaker on the program at Infonortics’ Boston Search Engine Meeting agreed to be interviewed by Harry Collier, the founder of the premier search and content processing event. Francisco Corella is one of the senior managers of Pomcor. The company’s Noflail search system leverages open source and Yahoo’s BOSS (build your own search system). Navigate to the Infonortics.com Web site and sign up for the conference today. In Boston, you can meet Mr. Corella and other innovators in information retrieval.

The full text of the interview appears below:

Will you describe briefly your company and its search technology?

Pomcor is dedicated to Web technology innovation. In the area of search we have created Noflail Search, a search interface that runs on the Flex platform. Search results are currently obtained from the Yahoo BOSS API, but this may change in the future. Noflail Search helps the user solve tough search problems by prefetching the results of related queries, and supporting the simultaneous browsing of the result sets of multiple queries. It sounds complicated, but new users find the interface familiar and comfortable from the start. Noflail Search also lets users save useful queries—yes, queries, not results. This is akin to bookmarking the queries, but a lot more practical.

What are the three major challenges you see in search / content processing in 2009?

First challenge: what I call the indexable unit problem. A Web page is often not the desired indexable unit. If you want to cook sardines with triple sec (after reading Thurber) and issue a query [sardines “triple sec”] you will find pages that have a recipe with sardines and a recipe with triple sec. If there is a page with a recipe that uses both sardines and triple sec, it may be buried too deep for you to find. In this case the desired indexable unit is the recipe, not the page. Other indexable units: articles in a catalog, messages in an email archive, blog entries, news. There are ad-hoc solutions for blog entries and news, but no general-purpose solutions.

Second challenge: what I call the deep API problem. Several search engines offer public Web APIs that enable search mashups. Yahoo, in particular, encourages developers to reorder search results and merge results from different sources. But no search API provides more than the first 1000 results from any result set, and you cannot reorder a set if you only have a tiny subset of its elements. What’s needed is a deep API that lets you build your own index from crawler raw data or by combining multiple sources.

Third challenge: incorporate semantic technology into mainstream search engines.

With search processing decades old, what have been the principal barriers to resolving these challenges in the past?

The three challenges have not been resolved for different reasons. Indexable units require a new standard to specify the units within a page, and a restructuring of the search engines; hence a lot of inertia stands in the way of a solution. The need for a deep API is new and not widely recognized yet. And semantics are inherently difficult.

What is your approach to problem solving in search and content processing? Do you focus on smarter software, better content processing, improved interfaces, or some other specific area?

Noflail Search is a substantial improvement on the traditional search interface. Nothing more, nothing less. It may be surprising that such an improvement is coming now, after search engines have been in existence for so many years. Part of the reason for this may be that Google has a quasi-monopoly in Web search, and monopolies tend to stifle innovation. Our innovations are a direct result of the appearance of public Web APIs, which lower the barrier to entry and foster innovation.

With the rapid change in the business climate, how will the increasing financial pressure on information technology affect search / content processing?

The crisis may have both negative and positive effects on search innovation. Financial pressure causes consolidation, which reduces innovation. But the urge to reduce cost could also lead to the development of an ecosystem where different players solve different pieces of the search puzzle. Some could specialize in crawler software, some in index construction, some in user interface improvements, some in various aspects of semantics, some in various vertical markets.

A technogical ecosystem materialized in the 80’s for the PC industry, and resulted in amazing cost reduction. Will this happen again for search? Today we are seeing mixed signals. We see reasons for hope in the emergence of many alternative search engines, and the release by Microsoft of Live Search API 2.0 with support for revenue sharing. On the other hand, Amazon recently dropped Alexa, and Yahoo is now changing the rules of the game for Yahoo BOSS, reneging on its promise of free API access with revenue sharing.

Multi core processors provide significant performance boosts. But search / content processing often faces bottlenecks and latency in indexing and query processing. What’s your view on the performance of your system or systems with which you are familiar? Is performance a non issue?

Noflail Search is computationally demanding. When the user issues a query, Noflail Search precomputes the result sets of up to seven related queries in addition to the result set of the original query, and prefetches the first page of each result set. If the query has no results (which may easily happen in a search restricted to a particular Web site), it determines the most specific subqueries (queries with fewer terms) that do produce results; this requires traversing the entire subgraph of subqueries with zero results and its boundary, computing the results set of each node. All this is perfectly feasible and actually takes very little real time.

How do we do it?

Since Noflail Search is built on the Flex platform, the code runs on the Flash plug-in in the user’s computer and obtains
search results directly from the Yahoo Boss API. Furthermore, the code exploits the inherent parallelism of any Web API. Related queries are all run simultaneously. And the algorithm for traversing the zero-result subgraph is carefully designed to maximize concurrency.

Yahoo, however, has just announced that they will be charging fees for API queries instead of sharing ad revenue. If we continue to use Yahoo BOSS, it may not be econonmically feasible to prefecth the results of related queries or analyze zero results as we do now. Thus, although performance is a non-issue technically, demands of computational power have financial implications.

As you look forward, what are some new features / issues that you think will become more important in 2009?

Obviously we think that the new user interface features in Noflail Search are important and hope they’ll become widely used in 2009. We have of course filed patent applications on the new features, but we are very willing to license the inventions to others. As for a breakthrough over the next 36 months, as a consumer of search, I very much hope that the indexable unit problem will be solved. This would increase search accuracy and make life easier for everybody.

Where can I find more information about your products, services, and research?

Noflail Search is available at http://noflail.com/, and white papers on the new features can be found in the Search Technology page (http://www.pomcor.com/search_technology.html) of the Pomcor Web site http://www.pomcor.com/).

Harry Collier, Infonortics Ltd., February 11, 2009

Written by Stephen E. Arnold · Filed Under Enterprise, Interview, News, Open source, Search, Semantic, Technology, Yahoo

Comments

One Response to “Francisco Corella, Pomcor, an Exclusive Interview”

Beklayexy on November 23rd, 2009 4:15 pm

Outstanding Article , I thought it was extraordinary

I look forward to more interesting postings like this one. Does Your Blog have a newsletter I can subscribe to for updates?

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.