Semantics in Firefox

February 19, 2009

Now available: the wonders of semantic search, plugged right into your Mozilla Firefox browser. headup started in closed testing but is now a public beta model downloadable from http://www.headup.com or http://addons.mozilla.org. You do have to register for it because Firefox lists it as “experimental,” but the reviews at https://addons.mozilla.org/en-US/firefox/reviews/display/10359 are glowing. A product of SemantiNet this plugin is touted to enable “true semantic capabilities” for the first time within any Web page. headup’s engine extracts customized info based on its user and roots out additional data of interest from across the Web, including social media sites like Facebook and Twitter. Looks like this add-on is a step in the right direction to bringing the Semantic Web down to earth. Check it out and let us know what you think.

Jessica Bratcher, February 19, 2009

Exclusive Interview with Kathleen Dahlgren, Cognition Technologies

February 18, 2009

Cognition Technologies’ Kathleen Dahlgren spoke with Harry Collier about her firm’s search and content processing system. Cognition’s core technology, Cognition’s Semantic NLPTM, is the outgrowth of ideas and development work which began over 23 years ago at IBM where Cognition’s founder and CTO, Kathleen Dahlgren, Ph.D., led a research team to create the first prototype of a “natural language understanding system.” In 1990, Dr. Dahlgren left IBM and formed a new company called Intelligent Text Processing (ITP). ITP applied for and won an innovative research grant with the Small Business Administration. This funding enabled the company to develop a commercial prototype of what would become Cognition’s Semantic NLP. That work won a Small Business Innovation Research (SBIR) award for excellence in 1995. In 1998, ITP was awarded a patent on a component of the technology.

Dr. Dahlgren is one of the featured speakers at the Boston Search Engine Meeting. This conference is the world’s leading venue for substantive discussions about search, content processing, and semantic technology. Attendees have an opportunity to hear talks by recognized leaders in information retrieval and then speak with these individuals, ask questions, and engage in conversations with other attendees. You can get more information about the Boston Search Engine Meeting here.

The full text of Mr. Collier’s interview with Dr. Dahlgren, conducted on February 13, 2009, appears below:

Will you describe briefly your company and its search / content processing technology?
CognitionSearch uses linguistic science to analyze language and provide meaning-based search.  Cognition has built the largest semantic map of English with morphology (word stems such as catch-caught, baby-babies, communication, intercommunication), word senses (strike meaning hit, strike a state of baseball, etc.), synonymy (“strike” meaning hit, “beat” meaning hit, etc.), hyponymy (“vehicle”-“motor vehicle”-“car”-“Ford”), meaning contexts (“strike” means game state in the context of “baseball”) and phrases (“bok-choy”).  .  The semantic map enables CognitionSearch to unravel the meaning of text and queries, with the result that  search performs with over 90% precision and 90% recall.

What are the three major challenges you see in search / content processing in 2009?

That’s a good question. The three challenges in my opinion are:

  1. Too much irrelevant material retrieved – poor precision
  2. Too much relevant material missed – poor recall
  3. Getting users to adopt new ways of searching that are available with advanced search technologies.  NLP semantic search offers users the opportunity to state longer queries in plain English and get results, but they are currently used to keywords, so there will be an adaptation required of them to take advantage of the new advanced technology.

With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?

Poor precision and poor recall are due to the use of pattern-matching and statistical search software.  As long as meaning is not recovered, the current search engines will produce mostly irrelevant material.  Statistics on popularity boost many of the relevant  results to the top, but as a measure across all retrievals, precision is under 30%.  Poor recall means that sometimes there are no relevant hits, even though there may be many hits.  This is because the alternative ways of expressing the user’s intended meaning in the query are not understood by the search engine.  If they add synonyms without first determining meaning, recall can improve, but at the expense of extremely poor precision.  This is because all the synonyms of an ambiguous word in all of its meanings, are used as search terms.    Most of these are off target.  While the ambiguous words in a language are relatively few, they are among the most frequent words.  For example, the seventeen thousand most frequent words of English tend to be ambiguous.

What is your approach to problem solving in search and content processing?

Cognition focuses on improving search by improving the underlying software and making it mimic human linguistic reasoning in many respects.  CognitionSearch first determines the meanings of words in context and then searches on the particular meanings of search terms, their synonyms (also disambiguated) and hyponyms (more specific word meanings in a concept hierarchy or ontology).  For example, given a search for “mental disease in kids”  CognitionSearch first determines that “mental disease” is a phrase, and synonymous with an ontological node, and that “kids” has stem “kid”, and that it means “human child” not a type of “goat”.  It then finds document with sentences having “mental-dsease” or “OCD” or “obsessive compulsive disorder” or “schizophrenia”, etc. and “kid” (meaning human child) or “child” (meaning human child) or “young person” or “toddler”, etc.

Multi core processors provide significant performance boosts. But search / content processing often faces bottlenecks and latency in indexing and query processing. What’s your view on the performance of your system or systems with which you are familiar?

Natural language processing systems have been notoriously challenged by scalability.  Recent massive upgrades in computer power have now made NLP a possibility in web search.  CognitionSearch has sub-second response time and is fully distributed to as many processors as desired for both indexing and search.  Distribution is one solution to scalability.  Another CognitionSearch implements is to compile all reasoning into the index, so that any delays caused by reasoning are not experienced by the end user.

Google has disrupted certain enterprise search markets with its appliance solution. The Google brand creates the idea in the minds of some procurement teams and purchasing agents that Google is the only or preferred search solution. What can a vendor do to adapt to this Google effect? Is Google a significant player in enterprise search, or is Google a minor player?

Google’s search appliance highlights the weakness of popularity-based searching.  On the web, with Google’s vast history of searches, popularity is effective in positioning  the more desired sites at the top the relevance rank.  Inside the enterprise, popularity is ineffective and Google performs as a plain pattern-matcher.  Competitive vendors need to explain this to clients, and even show them with head-to-head comparisons of search with Google and search with their software on the same data.   Google brand allegiance is a barrier to sales in enterprise search.

Information governance is gaining importance. Search / content processing is becoming part of eDiscovery or internal audit procedures. What’s your view of the the role of search / content processing technology in these specialized sectors?

Intelligent search in eDiscovery can dig up the “smoking gun” of violations within an organization.  For example, in the recent mortgage crisis, buyers were lent money without proper proof of income.  Terms for this were “stated income only”, “liar loan”, “no-doc loan”, “low-documentation loan”.  In eDiscovery, intelligent search such as CognitionSearch would find all mentions of that concept, regardless of the way it was expressed in documents and email.  Full exhaustiveness in search empowers lawyers analyzing discovery documents to find absolutely everything that is relevant or responsive.  Likewise, intelligent search empowers corporate oversight personnel, and corporate staff in general, to find the desired information without being inundated with irrelevant hits (retrievals).  Dedicated systems for eDiscovery and corporate search  need only house the indices, not the original documents.  It should be possible to host a company-wide secure Web site for internal search at low cost.

As you look forward, what are some new features / issues that you think will become more important in 2009? Where do you see a major break-through over the next 36 months?

Semantics and the semantic web have attracted a great deal of interest lately.  One type of semantic search involves tagging of documents and Web sites, and relating them to each other in a hierarchy expressed in the tags.  This type of semantic search enables taggers to perfectly control reasoning with respect to the various documents or sites, but is labor-intensive.   Another type of semantic search is runs on free text, is fully automatic, and uses semantically-based software to automatically characterize the meaning of documents and sites, as with CognitionSearch.

Mobile search is emerging as an important branch of search / content processing. Mobile search, however, imposes some limitations on presentation and query submission. What are your views of mobile search’s impact on more traditional enterprise search / content processing?

Mobile search heightens the need for improved precision, because the devices don’t have space to display millions of results, most of which are irrelevant.

Where can I find more information about your products, services, and research?

http://www.cognition.com

Harry Collier, Infonortics, Ltd., February 18, 2009

Attensity’s Newest Partner

February 17, 2009

Attensity is out leveraging its text analytics software. The company just partnered up with enherent whose tagline is “Gather, manage and transform your data and content into timely, secure, actionable intelligence.” Now enherent will be using Attensity’s software to perform the analytics, manage risk, and review customer input on a larger scale. A press release said the idea is to take advantage of new “ideas in a time when business success depends on innovation.” While enherent gets the Attensity’s First Person Intelligence Platform with its vocabularies, analytics and subject matter expertise, Attensity gets exposure to a long list of customers and resources, more expertise in text analytics to advance its skill sets and a higher profile in the industry. Attensity looks like it’s making smart decisions for the future. Keep an eye on them.

Jessica W. Bratcher, February 17, 2009

Exclusive Interview with David Milward, CTO, Linguamatics

February 16, 2009

Stephen Arnold and Harry Collier interviewed David Milward,the chief technical officer of Linguamatics, on February 12, 2009. Mr. Milward will be one of the featured speakers at the April 2009 Boston Search Engine Meeting. You will find minimal search “fluff” at this important conference. The focus is upon search, information retrieval, and content processing. You will find no trade show booths staffed, no multi-track programs that distract, and no search engine optimization sessions. The Boston Search Engine Meeting is focused on substance from informed experts. More information about the premier search conference is here. Register now.

The full text of the interview with David Milward appears below:

Will you describe briefly your company and its search / content processing technology?

Linguamatics’ goal is to enable our customers to obtain intelligent answers from text – not just lists of documents.  We’ve developed agile natural language processing (NLP)-based technology that supports meaning-based querying of very large datasets. Results are delivered as relevant, structured facts and relationships about entities, concepts and sentiment.
Linguamatics’ main focus is solving knowledge discovery problems faced by pharma/biotech organizations. Decision-makers need answers to a diverse range of questions from text, both published literature and in-house sources. Our I2E semantic knowledge discovery platform effectively treats that unstructured and semi-structured text as a structured, context-specific database they can query to enable decision support.

Linguamatics was founded in 2001, is headquartered in Cambridge, UK with US operations in Boston, MA. The company is privately owned, profitable and growing, with I2E deployed at most top-10 pharmaceutical companies.

splash page

What are the three major challenges you see in search / content processing in 2009?

The obvious challenges I see include:

  • The ability to query across diverse high volume data sources, integrating external literature with in-house content. The latter content may be stored in collaborative environments such as SharePoint, and in a variety of formats including Word and PDF, as well as semi-structured XML.
  • The need for easy and affordable access to comprehensive content such as scientific publications, and being able to plug content into a single interface.
  • The demand by smaller companies for hosted solutions.

With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?

People have traditionally been able to do simple querying across multiple data sources, but there has been an integration challenge in combining different data formats, and typically the rich structure of the text or document has been lost when moving between formats.

Publishers have tended to develop their own tools to support access to their proprietary data. There is now much more recognition of the need for flexibility to apply best of breed text mining to all available content.

Potential users were reluctant to trust hosted services when queries are business- sensitive. However, hosting is becoming more common, and a considerable amount of external search is already happening using Google and, in the case of life science researchers, PubMed.

What is your approach to problem solving in search and content processing?

Our approach encompasses all of the above. We want to bring the power of NLP-based text mining to users across the enterprise – not just the information specialists.  As such we’re bridging the divide between domain-specific, curated databases and search, by providing querying in context. You can query diverse unstructured and semi-structured content sources, and plug in terminologies and ontologies to give the context. The results of a query are not just documents, but structured relationships which can be used for further data mining and analysis.

Multi core processors provide significant performance boosts. But search / content processing often faces bottlenecks and latency in indexing and query processing. What’s your view on the performance of your system or systems with which you are familiar?

Our customers want scalability across the board – both in terms of the size of the document repositories that can be queried and also appropriate querying performance.  The hardware does need to be compatible with the task.  However, our software is designed to give valuable results even on relatively small machines.

People can have an insatiable demand for finding answers to questions – and we typically find that customers quickly want to scale to more documents, harder questions, and more users. So any text mining platform needs to be both flexible and scalable to support evolving discovery needs and maintain performance.  In terms of performance, raw CPU speed is sometimes less of an issue than network bandwidth especially at peak times in global organizations.

Information governance is gaining importance. Search / content processing is becoming part of eDiscovery or internal audit procedures. What’s your view of the the role of search / content processing technology in these specialized sectors?

Implementing a proactive e-Discovery capability rather than reacting to issues when they arrive is becoming a strategy to minimize potential legal costs. The forensic abilities of text mining are highly applicable to this area and have an increasing role to play in both eDiscovery and auditing. In particular, the ability to search for meaning and to detect even weak signals connecting information from different sources, along with provenance, is key.

As you look forward, what are some new features / issues that you think will become more important in 2009? Where do you see a major break-through over the next 36 months?

Organizations are still challenged to maximize the value of what is already known – both in internal documents or in published literature, on blogs, and so on.  Even in global companies, text mining is not yet seen as a standard capability, though search engines are ubiquitous. This is changing and I expect text mining to be increasingly regarded as best practice for a wide range of decision support tasks. We also see increasing requirements for text mining to become more embedded in employees’ workflows, including integration with collaboration tools.

Graphical interfaces and portals (now called composite applications) are making a comeback. Semantic technology can make point and click interfaces more useful. What other uses of semantic technology do you see gaining significance in 2009? What semantic considerations do you bring to your product and research activities?

Customers recognize the value of linking entities and concepts via semantic identifiers. There’s effectively a semantic engine at the heart of I2E and so semantic knowledge discovery is core to what we do.  I2E is also often used for data-driven discovery of synonyms, and association of these with appropriate concept identifiers.

In the life science domain commonly used identifiers such as gene ids already exist.  However, a more comprehensive identification of all types of entities and relationships via semantic web style URIs could still be very valuable.

Where can I find more information about your products, services, and research?

Please contact Susan LeBeau (susan.lebeau@linguamatics.com, tel: +1 774 571 1117) and visit www.linguamatics.com.

Stephen Arnold (ArnoldIT.com) and Harry Collier (Infonortics, Ltd.), February 16, 2009

Evri Revs Up with a Dead Tree Client

February 14, 2009

I was surprised to receive a news release from an outfit that was off my radar. The company is called Evri and I think it is located in Seattle. The company’s top dog is a former Amazon wizard named Neil Roseman. The core of the news release was:

[The] washingtonpost.com will start adding the Evri content recommendation widget to all article pages beginning this week. The Evri Widget launches on article pages as they are posted on the site with the goal of helping washingtonpost.com’s readers discover related content from the newspaper and from across the Web that complement the articles they are reading.

To me, this means that Evri is a text and content processing company. Its system generates See Also and Use For references for stories on the Washington Post Web site. Links illustrating the technology are:

According to the new release:

Evri (www.evri.com) is a technology company developing products that change the way consumers discover and engage with content on the Web. Led by CEO Neil Roseman, Evri is a team of engineers with broad expertise in natural language processing and machine learning, who have delivered many successful Web products for great companies, including:  Amazon.com; DoubleClick; Microsoft; Real Networks; Sony, Yahoo and more. Neil joined Evri after more than 10 years at Amazon.com, most recently as Vice President of Software. Evri is based in Seattle, WA and is funded by Paul Allen’s Vulcan Capital.

Paul Allen was in the news at the goose pond this week. Charter Communications, another Paul Allen venture, filed for bankruptcy. You can read the details here. My list of more than 350 content processing company now includes one more. More information as we receive it here in Harrod’s Creek.

The goslings and I congratulate the Washington Post on making the leap to automated systems. I wonder if the train has left the station. Newsweek seems to be limping and the last time I visited the Washington Post Web site about five minutes ago, performance seemed sluggish. Your mileage may vary. Network latency is a good way to explain architectural issues.

Stephen Arnold, February 14, 2009

Arnold Interviewed in Content Matters

February 12, 2009

A feathering preening item. Barry Graubart, who works for the hot Alacra and edits the Content Matters Web log, interviewed Stephen E. Arnold, on February 11, 2009. The full text of the interview appears here. I read what I said and found it coherent, a radical change from most of my previous interview work for KDKA in Pittsburgh and a gig with a Charlotte 50,000 watt station. For me, the most interesting comment in the column was Mr. Graubert’s unexpected editorial opinion. Mr. Graubart graciously described me as the “author of the influential Beyond Search blog.” A happy quack for Barry Graubert.

Stephen Arnold, February 12, 2009

Francisco Corella, Pomcor, an Exclusive Interview

February 11, 2009

Another speaker on the program at Infonortics’ Boston Search Engine Meeting agreed to be interviewed by Harry Collier, the founder of the premier search and content processing event. Francisco Corella is one of the senior managers of Pomcor. The company’s Noflail search system leverages open source and Yahoo’s BOSS (build your own search system). Navigate to the Infonortics.com Web site and sign up for the conference today. In Boston, you can meet Mr. Corella and other innovators in information retrieval.

The full text of the interview appears below:

Will you describe briefly your company and its search technology?

Pomcor is dedicated to Web technology innovation.  In the area of search we have created Noflail Search, a search interface that runs on the Flex platform.  Search results are currently obtained from the Yahoo BOSS API, but this may change in the future.   Noflail Search helps the user solve tough search problems by prefetching the results of related queries, and supporting the simultaneous browsing of the result sets of multiple queries.  It sounds complicated, but new users find the interface familiar and comfortable from the start.  Noflail Search also lets users save useful queries—yes, queries, not results.  This is akin to bookmarking the queries, but a lot more practical.

What are the three major challenges you see in search / content processing in 2009?

First challenge: what I call the indexable unit problem.  A Web page is often not the desired indexable unit.  If you want to cook sardines with triple sec (after reading Thurber) and issue a query [sardines “triple sec”] you will find pages that have a recipe with sardines and a recipe with triple sec.  If there is a page with a recipe that uses both sardines and triple sec, it may be buried too deep for you to find.  In this case the desired indexable unit is the recipe, not the page.  Other indexable units: articles in a catalog, messages in an email archive, blog entries, news.  There are ad-hoc solutions for blog entries and news, but no general-purpose solutions.

Second challenge: what I call the deep API problem.  Several search engines offer public Web APIs that enable search mashups.  Yahoo, in particular, encourages developers to reorder search results and merge results from different sources.  But no search API provides more than the first 1000 results from any result set, and you cannot reorder a set if you only have a tiny subset of its elements.  What’s needed is a deep API that lets you build your own index from crawler raw data or by combining multiple sources.

Third challenge: incorporate semantic technology into mainstream search engines.

With search processing decades old, what have been the principal  barriers to resolving these challenges in the past?

The three challenges have not been resolved for different reasons. Indexable units require a new standard to specify the units within a page, and a restructuring of the search engines; hence a lot of inertia stands in the way of a solution.  The need for a deep API is new and not widely recognized yet.  And semantics are inherently difficult.

What is your approach to problem solving in search and content processing? Do you focus on smarter software, better content processing, improved interfaces, or some other specific area?

Noflail Search is a substantial improvement on the traditional search interface.  Nothing more, nothing less.  It may be surprising that such an improvement is coming now, after search engines have been in existence for so many years.  Part of the reason for this may be that Google has a quasi-monopoly in Web search, and monopolies tend to stifle innovation.  Our innovations are a direct result of the appearance of public Web APIs, which lower the barrier to entry and foster innovation.

With the rapid change in the business climate, how will the increasing financial pressure on information technology affect search / content processing?

The crisis may have both negative and positive effects on search innovation.  Financial pressure causes consolidation, which reduces innovation.  But the urge to reduce cost could also lead to the development of an ecosystem where different players solve different pieces of the search puzzle.  Some could specialize in crawler software, some in index construction, some in user interface improvements, some in various aspects of semantics, some in various vertical markets.

A technogical ecosystem materialized in the 80’s for the PC industry, and resulted in amazing cost reduction.  Will this happen again for search?  Today we are seeing mixed signals.  We see reasons for hope in the emergence of many alternative search engines, and the release by Microsoft of Live Search API 2.0 with support for revenue sharing. On the other hand, Amazon recently dropped Alexa, and Yahoo is now changing the rules of the game for Yahoo BOSS, reneging on its promise of free API access with revenue sharing.

Multi core processors provide significant performance boosts. But search / content processing often faces bottlenecks and latency in indexing and query processing. What’s your view on the performance of your system or systems with which you are familiar? Is performance a non issue?

Noflail Search is computationally demanding.  When the user issues a query, Noflail Search precomputes the result sets of up to seven related queries in addition to the result set of the original query, and prefetches the first page of each result set.  If the query has no results (which may easily happen in a search restricted to a particular Web site), it determines the most specific subqueries (queries with fewer terms) that do produce results; this requires traversing the entire subgraph of subqueries with zero results and its boundary, computing the results set of each node.  All this is perfectly feasible and actually takes very little real time.

How do we do it? 

Since Noflail Search is built on the Flex platform, the code runs on the Flash plug-in in the user’s computer and obtains
search results directly from the Yahoo Boss API.  Furthermore, the code exploits the inherent parallelism of any Web API.  Related queries are all run simultaneously.  And the algorithm for traversing the zero-result subgraph is carefully designed to maximize concurrency.

Yahoo, however, has just announced that they will be charging fees for API queries instead of sharing ad revenue.  If we continue to use Yahoo BOSS, it may not be econonmically feasible to prefecth the results of related queries or analyze zero results as we do now. Thus, although performance is a non-issue technically, demands of computational power have financial implications.

As you look forward, what are some new features / issues that you think will become more important in 2009?

Obviously we think that the new user interface features in Noflail Search are important and hope they’ll become widely used in 2009.  We have of course filed patent applications on the new features, but we are very willing to license the inventions to others. As for a breakthrough over the next 36 months, as a consumer of search, I very much hope that the indexable unit problem will be solved.  This would increase search accuracy and make life easier for everybody.

Where can I find more information about your products, services, and research?

Noflail Search is available at http://noflail.com/, and white papers on the new features can be found in the Search Technology page (http://www.pomcor.com/search_technology.html) of the Pomcor Web site http://www.pomcor.com/).

Harry Collier, Infonortics Ltd., February 11, 2009

Patent Search from Perfect Search

February 10, 2009

The Perfect Search Corp. Google patent search attracted significant traffic on February 9, 2009. If you have not tried the system, navigate to http://arnoldit.perfectsearchcorp.com and then open a new window for your browser. You will want to point that new tab or window at http://www.google.com/patents. Now run the following query in each system: programmable search engine.

Here’s what Google displays to you:

pse google

The Google results force me to track down the full patent document to figure out if it is “sort of” what I want.

Run the same query programmable search engine and you get these results:

pse perf search

The ArnoldIT.com and Perfect Search collection makes it easy to decide if a hit is germane. A single click delivers the PDF of the patent document. Everything is clear and within a single interface.

Try these services and run the same query. You will find that you can dig into Google’s technology quickly and without the silly extra steps that other services insist upon. The USPTO search system is here. How about that USPTO search interface? I don’t want to name the vendor who provides this system and I don’t want to call attention to this company’s inability to make a user accessible system. FreePatentsOnline.com is here. Do you understand the results and how you access the various parts of a patent document? I do, but it takes quite a bit of work.

Notice the differences. First, the abstracts or summary of the patent is more useful because it contains substantive detail. Second, the key words in the query are in bold face, making it easy to spot the terms that were in your query. Third, notice the link to the PDF file. You don’t see fragments of the patent document. You get one click access to the patent document plus the diagrams if any. Fourth, because the Google patent collection includes only Google patent documents, you can easily explore the technical innovations that often “fly under the radar” of the Google watchers who deal with surface issues.

Information about the Perfect Search system is here. You can read an interview with one of the Perfect Search senior engineers, Ken Ebert, here. A happy quack to the Perfect Search team for contributing to this free Google patent document search and full text delivery service. Right now the system includes the Google patent documents that I have been able to identify in the course of my research into Google’s technical infrastructure. I cannot say with certainty that this collection has every Google patent application and granted patent. If you know of a Google patent document, I have overlooked, please, let me know. I am not an attorney and take my advice, “Don’t use this system as your only source of Google patent information.” It’s free and not the whiz bang service that West and Lexis provide for a fee. A hefty fee I might add.

Stephen Arnold, February 10, 2009

Semantic Engines Dmitri Soubbotin Exclusive Interview

February 10, 2009

Semantics are booming. Daily I get spam from the trophy generation touting the latest and greatest in semantic technology. A couple of eager folks are organizing a semantic publishing system and gearing up for a semantic conference. I think these efforts are admirable, but I think that the trophy crowd confuses public relations with programming on occasion. Not Dmitri Soubbotin, one of the senior managers at Semantic Engines. Harry Collier and I were able to get the low-profile wizard to sit down and talk with us. Mr. Soubbotin’s interview with Harry Collier (Infonortics Ltd.) and me appears below.

Please, keep in mind that Dmitri Soubbotin is one of world class search, content processing, and semantic technologies experts who will be speaking at the April 2009 Boston Search Engine Meeting. Unlike fan-club conferences or SEO programs designed for marketers, the Boston Search Engine Meeting tackles substantive subjects in an informed way. The opportunity to talk with Mr. Soubbotin or any other speaker at this event is a worthwhile experience. The interview with Mr. Soubbotin makes clear the approach that the conference committee for the Boston Search Engine Meeting. Substance, not marketing hyperbole is the focus for the two day program. For more information and to register, click here.

Now the interview:

Will you describe briefly your company and its search / content
processing technology?

Semantic Engines is mostly known for its search engine SenseBot (www.sensebot.net). The idea of it is to provide search results for a user’s query in the form of a multi-document summary of the most relevant Web sources, presented in a coherent order. Through text mining, the engine attempts to understand what the Web pages are about and extract key phrases to create a summary.

So instead of giving a collection of links to the user, we serve an answer in the form of a summary of multiple sources. For many informational queries, this obviates the need to drill down into individual sources and saves the user a lot of time. If the user still needs more detail, or likes a particular source, he may navigate to it right from the context of the summary.

Strictly speaking, this is going beyond information search and retrieval – to information synthesis. We believe that search engines can do a better service to the users by synthesizing informative answers, essays, reviews, etc., rather than just pointing to Web sites. This idea is part of our patent filing.

Other things that we do are Web services for B2B that extract semantic concepts from texts, generate text summaries from unstructured content, etc. We also have a new product for bloggers and publishers called LinkSensor. It performs in-text content discovery to engage the user in exploring more of the content through suggested relevant links.

What are the three major challenges you see in search / content processing in 2009?

There are many challenges. Let me highlight three that I think are interesting:

First,  Relevance: Users spend too much time searching and not always finding. The first page of results presumably contains the most relevant sources. But unless search engines really understand the query and the user intent, we cannot be sure that the user is satisfied. Matching words of the query to words on Web pages is far from an ideal solution.

Second, Volume: The number of results matching a user’s query may be well beyond human capacity to review them. Naturally, the majority of searchers never venture beyond the first page of results – exploring the next page is often seen as not worth the effort. That means that a truly relevant and useful piece of content that happens to be number 11 on the list may become effectively invisible to the user.

Third, Shallow content: Search engines use a formula to calculate page rank. SEO techniques allow a site to improve its ranking through the use of keywords, often propagating a rather shallow site up on the list. The user may not know if the site is really worth exploring until he clicks on its link.

With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?

Not understanding the intent of the user’s query and matching words syntactically rather than by their sense – these are the key barriers preventing from serving more relevant results. NLP and text mining techniques can be employed to understand the query and the Web pages content, and come up with an acceptable answer for the user. Analyzing
Web page content on-the-fly can also help in distinguishing whether a page has value for the user or not.
Of course, the infrastructure requirements would be higher when semantic analysis is used, raising the cost of serving search results. This may have been another barrier to broader use of semantics by
major search engines.

What is your approach to problem solving in search and content processing? Do you focus on smarter software, better content processing, improved interfaces, or some other specific area?

Smarter, more intelligent software. We use text mining to parse Web pages and pull out the most representative text extracts of them, relevant to the query. We drop the sources that are shallow on content, no matter how high they were ranked by other search engines. We then order the text extracts to create a summary that ideally serves as a useful answer to the user’s query. This type of result is a good fit for an informational query, where the user’s goal is to
understand a concept or event, or to get an overview of a topic. The closer together are the source documents (e.g., in a vertical space), the higher the quality of the summary.

Search / content processing systems have been integrated into such diverse functions as business intelligence and customer support. Do you see search / content processing becoming increasingly integrated
into enterprise applications?

More and more, people expect to have the same features and user interface when they search at work as they get from home. The underlying difference is that behind the firewall the repositories and taxonomies are controlled, as opposed to the outside world. On one hand, it makes it easier for a search application within the enterprise as it narrows its focus and the accuracy of search can get higher. On the other hand, additional features and expertise would be required compared to the Web search. In general, I think the opportunities in the enterprise are growing for standalone search
providers with unique value propositions.

As you look forward, what are some new features / issues that you think will become more important in 2009? Where do you see a major break-through over the next 36 months?

I think the use of semantics and intelligent processing of content will become more ubiquitous in 2009 and further. For years, it has been making its way from academia to “alternative” search engines, occasionally showing up in the mainstream. I think we are going to see much higher adoption of semantics by major search engines, first of all Google. Things have definitely been in the works, showing as small improvements here and there, but I expect a critical mass of
experimenting to accumulate and overflow into standard features at some point. This will be a tremendous shift in the way search is perceived by users and implemented by search engines. The impact on the SEO techniques that are primarily keyword-based will be huge as well. Not sure whether this will happen in 2009, but certainly within
the next 36 months.

Graphical interfaces and portals (now called composite applications) are making a comeback. Semantic technology can make point and click interfaces more useful. What other uses of semantic technology do you see gaining significance in 2009? What semantic considerations do you bring to your product and research activities?

I expect to see higher proliferation of Semantic Web and linked data. Currently, the applications in this field mostly go after the content that is inherently structured although hidden within the text – contacts, names, dates. I would be interested to see more integration of linked data apps with text mining tools that can understand unstructured content. This would allow automated processing of large volumes of unstructured content, making it semantic web-ready.

Where can we find more information about your products, services, and research?

Our main sites are www.sensebot.net and www.semanticengines.com. LinkSensor, our tool for bloggers/publishers is at www.linksensor.com. A more detailed explanation of our approach with examples can be found in the following article:
http://www.altsearchengines.com/2008/Q7/22/alternative-search-results/.

Stephen Arnold (Harrod’s Creek, Kentucky) and Harry Collier (Tetbury, Glou.), February 10, 2009

Upgrades to ArnoldIT.com’s Google Patent Collection

February 9, 2009

I have made an effort to gather Google’s patent documents filed at the US Patent & Trademark Office. When The Google Legacy came out in 2005, I posted some of the Google patent documents referenced in that study. When Google Version 2.0 was published in 2007, I made available to those who purchased that study, a link to the patent documents referenced in that monograph. These Google studies were based on my reading of Google’s open source technical information. Most of the Google books now available steer clear of Google’s systems and methods. My writings are intended for specialist readers, not the consumer audience.

You can now search the full text of Google’s patent documents from 1998 to 2008 by navigating to http://arnoldit.perfectsearchcorp.com. The Perfect Search engineers have indexed the XHTML versions of the documents which have been available on the ArnoldIT.com server. ArnoldIT.com has provided pointers so that a user can click on a link and access the PDF version of Google’s patent applications and patents. No more hunting for a specific patent document PDF using weird and arcane commands. Just click and you can view or download the PDF of a Google patent document. The service is free.

The ArnoldIT.com team has made an attempt to collect most Google patent documents, but there are a number of patent documents referenced in various Google documents that remain elusive. Keep in mind that the information is open source, and I am providing it as a partial step in a user’s journey to understand some aspects of Google. If you are an attorney, you should use the USPTO service or a commercial service from Westlaw or LexisNexis. Those organizations often assert comprehensiveness, accuracy, and a sensitivity to the nuances of legal documents. I am providing a collection that supports my research.

Google is now a decade old, and there is considerable confusion among those who use and analyze Google with regard to the company’s technology. Google provides a patent search service here. But I find it difficult to use, and in some cases, certain documents seem to be hard for me to find.

I hope that Googlers who are quick to tell me that I am writing about Google technology that Google does not possess will be able to use this collection to find Google’s own documents. I have learned that trophy generation Googlers don’t read some of their employer’s open source documents, government filings, and technical papers.

Perfect Search Corp. is the first company to step forward and agree to index these public domain documents. You will find that the Perfect Search system is very fast, and you can easily pinpoint certain Google patent documents in a fraction of the time required when you use Google’s own service or the USPTO’s sluggish and user hostile system.

“The Google Patent Demonstration illustrates the versatility of the Perfect Search system. Response time is fast, precision and recall are excellent, and access to Google’s inventions is painless,” Tim Stay, CEO of Perfect Search, said.

Perfect Search’s software uses semantic technology and allows clients to index and search massive data sets with near real-time incremental indexing at high speeds without latency. It is meant to augment the Google Search Appliance.
Perfect Search technology, explained in depth at http://www.perfectsearchcorp.com/technology-benefits, provides a very economical single-server solution for customers to index files and documents and can add the capability of indexing large amounts of database information as well.

Perfect Search is a software innovation company that specializes in development of search solutions. A total of eight patents have been applied for around the developing technology. The suite of search products at http://www.perfectsearchcorp.com/our-products is available on multiple platforms, from small mobile devices, to single servers, to large server farms. For more information visit http://www.perfectsearchcorp.com/, call +1.801.437.1100 or e-mailinfo@perfectsearchcorp.com.

In the future, I would like to make this collection available to other search and content processing companies. The goal would be to allow users to be able to dig into some of Google’s inventions and learn about the various search systems. Head-to-head comparisons are very useful, but very few organizations in my experience take the time to prepare a test corpus and then use different systems to determine which is more appropriate for a particular application.

If you have suggestions for this service, use the comments section for this Web log.

Stephen Arnold, February 9, 2009

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta