Attivio’s Sid Probstein: An Exclusive Interview

February 25, 2009

I caught up with Sid Probstein, Attivio’s engaging chief technologist on February 23, 2009. Attivio is a new breed information company. The company combines a number of technologies to allow its licensees to extract more value from structured and unstructured information. Mr. Probstein is one of the speakers at the Boston Search Engine Meeting, a show that is now recognized as one of the most important venues for those serious about search, information retrieval, and content processing. You can register to attend this year’s conference here. Too many conferences features confusing multi track programs, cavernous exhibit halls, and annoyed attendees who find that the substance of the program does not match the marketing hyperbole. When you attend the Boston Search Engine Meeting, you have opportunities to talk directly to influential experts like Mr. Probstein. The full text of the interview appears below.

Will you describe briefly your company and its search / content processing technology? If you are not a company, please, describe your research in search / content processing.

Attivio’s Active Intelligence Engine (AIE) is powering today’s critical business solutions with a completely new approach to unifying information access. AIE supports querying with the precision of SQL and the fuzziness of full-text search. Our patent-applied-for query-side JOIN() operator allows relational data to be manipulated as a database would, but in combination with full-text operations like fuzzy search, fielded search, Boolean search, etc. Finally our ability to save any query as an alert and thereafter have new data trigger a workflow that may notify a user or update another system, brings a sorely needed “active” component to information access.

By extending enterprise search capabilities across documents, data and media, AIE brings deeper insight to business applications and Web sites. AIE’s flexible design enables business and technology leaders to speed innovation through rapid prototyping and deployment, which dramatically lowers risk – and important consideration in today’s economy. Systems integrators, independent software vendors, corporations and government agencies partner with Attivio to automate information-driven processes and gain competitive advantage.

What are the three major challenges you see in search / content processing in 2009?

May I offer three plus a bonus challenge?

First, understanding structured and unstructured data; currently most search engines don’t deal with structured data as it exists; they remove or require removal of the relationships. Retaining these relationships is the key challenge and a core value of information access.

Second, switching from the “pull” model in which end-users consume information, to the “push” model in which end-users and information systems are fed a stream of relevant information and analysis.

Third, being able to easily and rapidly construct information access applications. The year-long implementation cycle simply won’t cut it in the current climate; after all, that was the status quo for the past five years – long, challenging implementations, as search was still nascent. In 2009 what took months should take weeks. Also, the model has to change. Instead of trying to determine exactly how to build your information access strategy – the classic “aim, fire” approach – which often misses! – the new model is to “fire” and then “aim, aim aim” – correct your course and learn as you go so that you ultimately produce an application you are delighted with.

I also want to mention supporting complex analysis and enrichment of many different forms of content. For example: identifying important fields, from a search perspective; detecting relationships between pieces of content, or entire silos of content. This is key to breaking down silos – something leading analysts agree that this will be a major focus in enterprise IT starting in 2011.

With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?

There are several hurdles. First, the inverted index structure has not traditionally been able to deal with relationships; just terms and documents. Second, there still is a lack of tools to move data around, as opposed to simply obtaining content, has been a barrier for enterprise search in particular. There has not been an analog to “ETL” in the unstructured world. (The “connector” standard is about getting data, not moving it.) Finally, I think there’s a lack of a truly dynamic architecture has meant having to re-index when changing configuration or adding new types of data to the index; also a lack of support for rapid updates has lead to a proliferation of paired search engines and databases.

With the rapid change in the business climate, how will the increasing financial pressure on information technology affect search / content processing?

Information access is critically important during a recession. Every interaction with the customer has the potential to cause churn. Reducing churn is less costly by far then acquiring new customers. Good service is one of the keys to retaining customers, and a typical cause of poor service is … poor information access. A real life example: I recently rolled over my 401K. I had 30 days to do it, and did on the 28th day via phone. On the 29th day someone else from my financial services firm called back and asked me if I wanted to roll my 401K over. This was quite surprising. When asked why the representative didn’t know I had done it the day before, they said “I don’t have access to that information”. The cost of that information access problem was two phone calls: the second rollover call, and then another call back from me to verify that I had, in fact, rolled over my 401k.

From the internal perspective of IT, demand to turn-around information access solutions will be higher than ever. The need to show progress quickly has never been higher, so selecting tools that support rapid development via iteration and prototyping is critically important.

Search / content processing systems have been integrated into such diverse functions as business intelligence and customer support. Do you see search / content processing becoming increasingly integrated into enterprise applications?

Search is an essential feature in most every application used to create, manage or even analyze content. However, in this mode search is both a commodity and a de-facto silo of data. Standalone search and content processing will still be important as it is the best way to build applications using data across these silos. A good example here is what we call the Agile Content Network (ACN). Every content management system (CMS) has at least minimal search facilities. But how can a content provider create new channels and micro-sites of content across many incompatible CMSs? Standalone information access that can cut across silos is the answer.

Google has disrupted certain enterprise search markets with its appliance solution. The Google brand creates the idea in the minds of some procurement teams and purchasing agents that Google is the only or preferred search solution. What can a vendor do to adapt to this Google effect?

It is certainly true that Google has a powerful brand. However, vendors must promote transparency and help educate buyers so that they realize, on their own, the fit or non-fit of the GSA. It is also important to explain how what your product does is different from what Google does and how those differences apply to the customers’ needs for accessing information. Buyers are smart, and the challenge for vendors is to be sure to communicate and educate about needs, goals and the most effective way to attain them.

A good example of the Google brand blinding customers to their own needs is detailed in the following blog entry: http://www.attivio.com/attivio/blog/317-report-from-gilbane-2008-our-take-on-open-source-search.html

As you look forward, what are some new features / issues that you think will become more important in 2009? Where do you see a major break-through over the next 36 months?

I think that there continue to be no real standards around information access. We believe that older standards like SQL need to be updated with full-text capabilities. Legacy enterprise search vendors have traditionally focused on proprietary interfaces or driving their own standards. This will not be the case for the next wave of information access companies. Google and others are showing how powerful language modeling can be. I believe machine translation and various multi-word applications will all become part of the landscape in the next 36 months.

12. Mobile search is emerging as an important branch of search / content processing. Mobile search, however, imposes some limitations on presentation and query submission. What are your views of mobile search’s impact on more traditional enterprise search / content processing?

Mobile information access is definitely emerging in the enterprise. In the short term, it needs to become the instrument by which some updates are delivered – as alerts – and in other cases it is simply a notification that a more complex update – perhaps requiring a laptop – is available. In time mobile devices will be able to enrich results on their own. The iPhone, for example, could filter results using GPS location. The iPhone also shows that complex presentations are increasingly possible.

Ultimately, a mobile device, like the desktop, call center, digital home, brick and mortar store kiosk, are all access and delivery channels. Getting the information flow for each to work consistently while taking advantage of the intimacy of the medium (e.g. GPS information for mobile) is the future.

15. Where can I find more information about your products, services, and research?

The best place is our Web site: www.attivio.com.

Stephen Arnold, February 25, 2009

Exclusive Interview, Martin Baumgartel, From Library Automation to Search

February 23, 2009

For many years, Martin Baumgartel worked for a unit of T-Mobile. His experience spans traditional information retrieval and next-generation search. Stephen Arnold and Harry Collier interviewed Mr. Baumgartel on February 20, 2009. As one of the featured speakers at the premier search conference this spring, you will be able to hear Mr. Baumgartel’s lecture and meet with him in the networking and post presentation breaks. The Boston Search Engine Meeting attracts the world’s brightest minds and most influential companies to an “all content” program. You can learn more about the conference, the tutorials, and the speakers at the Infonortics Ltd. Web site. Unlike other conferences, the Boston Search Engine Meeting limits attendance in order to facilitate conversations and networking. Register early for this year’s conference.

What’s your background in search?

When I entered the search arena in the 1990s, I originated from library automation. Back then, it was all about indexing algorithms and relevance ranking where I did research to develop a search engine. During eight years at T-Systems, we analyzed the situation in large enterprises in order to provide the right search solution. This included, increasingly, the integration of semantic technologies. Given the present hype about semantic technologies, it has been a focus in current projects to determine which approach or product can deliver in specific search scenarios. A related problem is to identify underlying principles of user-interface-innovations to know what’s going to work (and what’s not).

What are the three major challenges you see in search / content processing in 2009?

Let me come at this in a non technical way. There are plenty of challenges awaiting algorithmic solutions, I see more important challenges here:

  1. Identifying the real objectives, fighting myths For an organization to implement internal search today hasn’t become any easier. There are numerous internal stakeholders, paired with a very high user expectation (they want the same quality as with Internet search, only better, more tailored to their work situation and without advertising…). To keep a sharp analysis becomes difficult in an orchestra of opinions, in particular when familiar brand names get involved (“Let’s just take Google internally, that will do.” )
  2. Avoid simplicity. Although many CIOs claim they have “cleaned up” their intranets, enterprise search remains complex; both technological and in terms of successful management. Therefore, to tackle the problem with a self-proclaimed simple solution (plug in, ready, go) will provide Search. But perhaps not the search solution needed and with hidden costs, especially on the long run. In the other extreme, a design too complex – with the purchase of dozens of connectors – is likely to burst your budget.
  3. Attention. Recently, I heard a lot about how the financial crisis will affect search. In my view, the effects are only reinforcing the challenge “How to draw enough management attention to Search to make sure it’s treated like other core assets”. Some customers might slow down the purchase of some SAP add-on modules or postpone a migration to the next version of Backup Software. But the status of those solutions among CIOs will remain high and un questioned.

With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?

There’s no unique definition of the ‘Enterprise Search Problem” as if it would be a math theorem. Therefore, you find somehow amorphous definitions about what is to be solved. Let’s take the scope of content to be searched: everything internal? And nothing external? Another obstacle is the widespread believe in shortcuts. Popular example: Let’s just index the content present in our internal content management system, the other content sources are irrelevant. That way, the concept of completeness in search/result set is sacrificed. But search can be as gruesome as the Marathon: you need endurance and there are no shortcuts. If you take a shortcut, you’ve failed.

What is your approach to problem solving in search and content processing?

Smarter software definitely, because the challenges in search (and there are more than three) are attracting programmers and innovators to come up with new solutions. But, in general, my approach is “keep your cool”. Assess the situation, analyze tools and environment, design the solution and explain it clearly. In the process, interfaces have to be improved sometimes in order to trim them down to fit with the corporate intranet design.

With the rapid change in the business climate, how will the increasing financial pressure on information technology affect search / content processing?

We’ll see how far a consolidation process will go. Perhaps we’ll see discontinued search products where we initially didn’t expect it. Also, the relation asked in the following question might be affected: software companies are unlikely to cut back at core features of their product. But integrated search functions are perhaps identified for the scalpel.

Search / content processing systems have been integrated into such diverse functions as business intelligence and customer support. Do you see search / content processing becoming increasingly integrated into enterprise applications?

I’ve seen it the other way around: Customer Support Managers told me (the Search person) that the built-in search-tool is ok but that they would like to look up additional information from some other internal applications. I don’t believe that built-in search will replace stand-alone search. The term “built-in” tells you that the main purpose of the application is something else. No surprise that, for instance, the user interface was designed for this main purpose – and will, in conclusion, not address typical needs of search.

Google has disrupted certain enterprise search markets with its appliance solution. What can a vendor do to adapt to this Google effect?

A vendor should point out where he differs from Google and why to address this Google-effect.

But I see Google as a significant player in enterprise search, if only for the mindset of procurement teams you describe in your question.

As you look forward, what are some new features / issues that you think will become more important in 2009?

The issue of cloudsourcing will gain traction. As a consequence, not only small and medium sized enterprises will discover that they might not invest in in house Content Management and Collaboration applications, but use a hosted service instead. This is when you need more than a “behind the firewall” search, because content will be scattered across multiple clouds (CRM cloud, Office cloud). I’m not sure whether we see a breakthrough there in 36 month; but the sooner the better.

Where can I find more information about your services and research?

http://www.linkedin.com/in/mbaumgartel

Stephen E. Arnold, www.arnoldit.com/sitemap.html and Harry Collier, www.infonortics.com

More Conference Woes

February 19, 2009

The landscape of conferences is like the hillside in the aftermath of Mt. St. Helen’s. Erick Schonfeld has a useful write up about DEMO, a tony conference that seems to be paddling upstream. You can read his article “DEMO Gets Desperate: Shipley Out, Marshall In” here. DEMO is just one of many conferences facing a tough market with an approach that strikes me as expensive and better suited for an economy past. I received an email this morning from a conference organizer who sent me a request to propose a paper. My colleague in Toronto and I proposed a paper based on new work we had done in content management and search. The conference organizer told us that there were too many papers on that type of subject but we were welcome to pay the registration fee and come to hear other speakers. My colleague and I wondered, “First, the organizer asks us to talk, then baits and switches us become paid attendees.” Our reaction was, “Not this time.” Here’s what I received in my email this morning, February 19, 2009:

Due to the current economy, I have decided to extend the Content Management ****/**** North America Conference Valentine’s Day discounted rate to March 2, 2009. This is a $200 discount for all Non-**** members. (**** members can register at anytime at a $300 discounted member rate.) This is meant for those of you needing additional time to get approval to attend the conference. I understand that with the current economy it is becoming harder to obtain funding for educational events. Hopefully by offering this type of discount I will be able to give you the extra support needed to get that final approval. [Emphasis added]

I have masked the specifics of this conference, but I read this with some skepticism.

Valentine’s Day is over. I surmise the traditional conference business is headed in that direction as well.

Telling me via an email that I need additional time to get approval to attend a conference is silly. I own my business. Furthermore, the organizer’s appeal  makes me suspicious of not just this conference but others that have been around a long time and offer little in the way of information that exerts a magnetic pull on me.

Conferences that have lost their sizzle are like my mom’s burned roast after a couple of days in the trash can. Not too appealing. What’s the fix? Innovation and creative thinking. Conference organizers who “run the game plan” don’t meet my needs right now. Venture Beat type conferences do.

Stephen Arnold, February 19, 2009

Exclusive Interview with Kathleen Dahlgren, Cognition Technologies

February 18, 2009

Cognition Technologies’ Kathleen Dahlgren spoke with Harry Collier about her firm’s search and content processing system. Cognition’s core technology, Cognition’s Semantic NLPTM, is the outgrowth of ideas and development work which began over 23 years ago at IBM where Cognition’s founder and CTO, Kathleen Dahlgren, Ph.D., led a research team to create the first prototype of a “natural language understanding system.” In 1990, Dr. Dahlgren left IBM and formed a new company called Intelligent Text Processing (ITP). ITP applied for and won an innovative research grant with the Small Business Administration. This funding enabled the company to develop a commercial prototype of what would become Cognition’s Semantic NLP. That work won a Small Business Innovation Research (SBIR) award for excellence in 1995. In 1998, ITP was awarded a patent on a component of the technology.

Dr. Dahlgren is one of the featured speakers at the Boston Search Engine Meeting. This conference is the world’s leading venue for substantive discussions about search, content processing, and semantic technology. Attendees have an opportunity to hear talks by recognized leaders in information retrieval and then speak with these individuals, ask questions, and engage in conversations with other attendees. You can get more information about the Boston Search Engine Meeting here.

The full text of Mr. Collier’s interview with Dr. Dahlgren, conducted on February 13, 2009, appears below:

Will you describe briefly your company and its search / content processing technology?
CognitionSearch uses linguistic science to analyze language and provide meaning-based search.  Cognition has built the largest semantic map of English with morphology (word stems such as catch-caught, baby-babies, communication, intercommunication), word senses (strike meaning hit, strike a state of baseball, etc.), synonymy (“strike” meaning hit, “beat” meaning hit, etc.), hyponymy (“vehicle”-“motor vehicle”-“car”-“Ford”), meaning contexts (“strike” means game state in the context of “baseball”) and phrases (“bok-choy”).  .  The semantic map enables CognitionSearch to unravel the meaning of text and queries, with the result that  search performs with over 90% precision and 90% recall.

What are the three major challenges you see in search / content processing in 2009?

That’s a good question. The three challenges in my opinion are:

  1. Too much irrelevant material retrieved – poor precision
  2. Too much relevant material missed – poor recall
  3. Getting users to adopt new ways of searching that are available with advanced search technologies.  NLP semantic search offers users the opportunity to state longer queries in plain English and get results, but they are currently used to keywords, so there will be an adaptation required of them to take advantage of the new advanced technology.

With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?

Poor precision and poor recall are due to the use of pattern-matching and statistical search software.  As long as meaning is not recovered, the current search engines will produce mostly irrelevant material.  Statistics on popularity boost many of the relevant  results to the top, but as a measure across all retrievals, precision is under 30%.  Poor recall means that sometimes there are no relevant hits, even though there may be many hits.  This is because the alternative ways of expressing the user’s intended meaning in the query are not understood by the search engine.  If they add synonyms without first determining meaning, recall can improve, but at the expense of extremely poor precision.  This is because all the synonyms of an ambiguous word in all of its meanings, are used as search terms.    Most of these are off target.  While the ambiguous words in a language are relatively few, they are among the most frequent words.  For example, the seventeen thousand most frequent words of English tend to be ambiguous.

What is your approach to problem solving in search and content processing?

Cognition focuses on improving search by improving the underlying software and making it mimic human linguistic reasoning in many respects.  CognitionSearch first determines the meanings of words in context and then searches on the particular meanings of search terms, their synonyms (also disambiguated) and hyponyms (more specific word meanings in a concept hierarchy or ontology).  For example, given a search for “mental disease in kids”  CognitionSearch first determines that “mental disease” is a phrase, and synonymous with an ontological node, and that “kids” has stem “kid”, and that it means “human child” not a type of “goat”.  It then finds document with sentences having “mental-dsease” or “OCD” or “obsessive compulsive disorder” or “schizophrenia”, etc. and “kid” (meaning human child) or “child” (meaning human child) or “young person” or “toddler”, etc.

Multi core processors provide significant performance boosts. But search / content processing often faces bottlenecks and latency in indexing and query processing. What’s your view on the performance of your system or systems with which you are familiar?

Natural language processing systems have been notoriously challenged by scalability.  Recent massive upgrades in computer power have now made NLP a possibility in web search.  CognitionSearch has sub-second response time and is fully distributed to as many processors as desired for both indexing and search.  Distribution is one solution to scalability.  Another CognitionSearch implements is to compile all reasoning into the index, so that any delays caused by reasoning are not experienced by the end user.

Google has disrupted certain enterprise search markets with its appliance solution. The Google brand creates the idea in the minds of some procurement teams and purchasing agents that Google is the only or preferred search solution. What can a vendor do to adapt to this Google effect? Is Google a significant player in enterprise search, or is Google a minor player?

Google’s search appliance highlights the weakness of popularity-based searching.  On the web, with Google’s vast history of searches, popularity is effective in positioning  the more desired sites at the top the relevance rank.  Inside the enterprise, popularity is ineffective and Google performs as a plain pattern-matcher.  Competitive vendors need to explain this to clients, and even show them with head-to-head comparisons of search with Google and search with their software on the same data.   Google brand allegiance is a barrier to sales in enterprise search.

Information governance is gaining importance. Search / content processing is becoming part of eDiscovery or internal audit procedures. What’s your view of the the role of search / content processing technology in these specialized sectors?

Intelligent search in eDiscovery can dig up the “smoking gun” of violations within an organization.  For example, in the recent mortgage crisis, buyers were lent money without proper proof of income.  Terms for this were “stated income only”, “liar loan”, “no-doc loan”, “low-documentation loan”.  In eDiscovery, intelligent search such as CognitionSearch would find all mentions of that concept, regardless of the way it was expressed in documents and email.  Full exhaustiveness in search empowers lawyers analyzing discovery documents to find absolutely everything that is relevant or responsive.  Likewise, intelligent search empowers corporate oversight personnel, and corporate staff in general, to find the desired information without being inundated with irrelevant hits (retrievals).  Dedicated systems for eDiscovery and corporate search  need only house the indices, not the original documents.  It should be possible to host a company-wide secure Web site for internal search at low cost.

As you look forward, what are some new features / issues that you think will become more important in 2009? Where do you see a major break-through over the next 36 months?

Semantics and the semantic web have attracted a great deal of interest lately.  One type of semantic search involves tagging of documents and Web sites, and relating them to each other in a hierarchy expressed in the tags.  This type of semantic search enables taggers to perfectly control reasoning with respect to the various documents or sites, but is labor-intensive.   Another type of semantic search is runs on free text, is fully automatic, and uses semantically-based software to automatically characterize the meaning of documents and sites, as with CognitionSearch.

Mobile search is emerging as an important branch of search / content processing. Mobile search, however, imposes some limitations on presentation and query submission. What are your views of mobile search’s impact on more traditional enterprise search / content processing?

Mobile search heightens the need for improved precision, because the devices don’t have space to display millions of results, most of which are irrelevant.

Where can I find more information about your products, services, and research?

http://www.cognition.com

Harry Collier, Infonortics, Ltd., February 18, 2009

Interview with Janus Boye: New Search Tutorial

February 18, 2009

In the last three years, Janus Boye, Managing Director of JBoye in Denmark, has been gaining influence as a conference organizer. In 2009, Mr. Boye is expanding to the United Kingdom and the United States. Kenny Toth, ArnoldIT.com, spoke with Mr. Boye on February 17, 2009. The full text of the interview appears below.

Why are you sponsoring a tutorial with the two enterprise search experts, Martin White and Stephen Arnold?

Personally I’m very fascinated with search as it is one of complex challenges of the web that remains essentially unsolved. For a while I’ve wanted to create a seminar on search that would cover technology, implementation and management to really assist our many community of practice members that on a regular basis tells me that search is broken. Some of them have invested heavily in software and found that even the most expensive software products does not solve successful search alone. Some of them have also found their vendor go either bankrupt or become acquired. Beyond vendors, many members have underestimated the planning required to make search work. Martin White and Stephen Arnold have recently published a new report on Successful Enterprise Search Management, which the seminar is modeled after.

image

Janus Boye. http://www.jboye.com

What will the attendees learn in the tutorial?

My goal is that at the end of the seminar, attendees will understand the business and management issues that impact a successful implementation. The attendees will learn about how the marketplace is shifting, what skills you need in your team, what can go wrong and how you avoid it, and how you get the most out of your consultants and vendors.

Isn’t search a stale subject? What will be new and unusual about this tutorial.

Search is far from a stable subject. If you are among those that use SharePoint every day, you know that search still have a long way to go. Come to the seminar and learn about the larger trends driving the market as well as recent developments, such as the Microsoft FAST roadmap

Will these be lectures or will there be interactivity between the experts and the audience?

The agenda for the seminar is done so that there will be plenty of room for interactivity. The idea is that delegates can get answers to their burning questions. There will be room for Q & A, and some sessions are also divided into sub-groups so that delegates can discuss their challenges in smaller groups.

If I attend, what will be the three or four takeaways from this show?

There will be several takeaways at the seminar, in particular around themes such as content, procurement, implementation, security, social search, language and the vendor marketplace.

Where is the tutorial and what are the details?

The tutorial will be held in London, UK. See http://www.jboye.co.uk/events/workshop-successful-enterprise-search-management-q209/ for more.

Kenny Toth, February 18, 2009

Exclusive Interview with David Milward, CTO, Linguamatics

February 16, 2009

Stephen Arnold and Harry Collier interviewed David Milward,the chief technical officer of Linguamatics, on February 12, 2009. Mr. Milward will be one of the featured speakers at the April 2009 Boston Search Engine Meeting. You will find minimal search “fluff” at this important conference. The focus is upon search, information retrieval, and content processing. You will find no trade show booths staffed, no multi-track programs that distract, and no search engine optimization sessions. The Boston Search Engine Meeting is focused on substance from informed experts. More information about the premier search conference is here. Register now.

The full text of the interview with David Milward appears below:

Will you describe briefly your company and its search / content processing technology?

Linguamatics’ goal is to enable our customers to obtain intelligent answers from text – not just lists of documents.  We’ve developed agile natural language processing (NLP)-based technology that supports meaning-based querying of very large datasets. Results are delivered as relevant, structured facts and relationships about entities, concepts and sentiment.
Linguamatics’ main focus is solving knowledge discovery problems faced by pharma/biotech organizations. Decision-makers need answers to a diverse range of questions from text, both published literature and in-house sources. Our I2E semantic knowledge discovery platform effectively treats that unstructured and semi-structured text as a structured, context-specific database they can query to enable decision support.

Linguamatics was founded in 2001, is headquartered in Cambridge, UK with US operations in Boston, MA. The company is privately owned, profitable and growing, with I2E deployed at most top-10 pharmaceutical companies.

splash page

What are the three major challenges you see in search / content processing in 2009?

The obvious challenges I see include:

  • The ability to query across diverse high volume data sources, integrating external literature with in-house content. The latter content may be stored in collaborative environments such as SharePoint, and in a variety of formats including Word and PDF, as well as semi-structured XML.
  • The need for easy and affordable access to comprehensive content such as scientific publications, and being able to plug content into a single interface.
  • The demand by smaller companies for hosted solutions.

With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?

People have traditionally been able to do simple querying across multiple data sources, but there has been an integration challenge in combining different data formats, and typically the rich structure of the text or document has been lost when moving between formats.

Publishers have tended to develop their own tools to support access to their proprietary data. There is now much more recognition of the need for flexibility to apply best of breed text mining to all available content.

Potential users were reluctant to trust hosted services when queries are business- sensitive. However, hosting is becoming more common, and a considerable amount of external search is already happening using Google and, in the case of life science researchers, PubMed.

What is your approach to problem solving in search and content processing?

Our approach encompasses all of the above. We want to bring the power of NLP-based text mining to users across the enterprise – not just the information specialists.  As such we’re bridging the divide between domain-specific, curated databases and search, by providing querying in context. You can query diverse unstructured and semi-structured content sources, and plug in terminologies and ontologies to give the context. The results of a query are not just documents, but structured relationships which can be used for further data mining and analysis.

Multi core processors provide significant performance boosts. But search / content processing often faces bottlenecks and latency in indexing and query processing. What’s your view on the performance of your system or systems with which you are familiar?

Our customers want scalability across the board – both in terms of the size of the document repositories that can be queried and also appropriate querying performance.  The hardware does need to be compatible with the task.  However, our software is designed to give valuable results even on relatively small machines.

People can have an insatiable demand for finding answers to questions – and we typically find that customers quickly want to scale to more documents, harder questions, and more users. So any text mining platform needs to be both flexible and scalable to support evolving discovery needs and maintain performance.  In terms of performance, raw CPU speed is sometimes less of an issue than network bandwidth especially at peak times in global organizations.

Information governance is gaining importance. Search / content processing is becoming part of eDiscovery or internal audit procedures. What’s your view of the the role of search / content processing technology in these specialized sectors?

Implementing a proactive e-Discovery capability rather than reacting to issues when they arrive is becoming a strategy to minimize potential legal costs. The forensic abilities of text mining are highly applicable to this area and have an increasing role to play in both eDiscovery and auditing. In particular, the ability to search for meaning and to detect even weak signals connecting information from different sources, along with provenance, is key.

As you look forward, what are some new features / issues that you think will become more important in 2009? Where do you see a major break-through over the next 36 months?

Organizations are still challenged to maximize the value of what is already known – both in internal documents or in published literature, on blogs, and so on.  Even in global companies, text mining is not yet seen as a standard capability, though search engines are ubiquitous. This is changing and I expect text mining to be increasingly regarded as best practice for a wide range of decision support tasks. We also see increasing requirements for text mining to become more embedded in employees’ workflows, including integration with collaboration tools.

Graphical interfaces and portals (now called composite applications) are making a comeback. Semantic technology can make point and click interfaces more useful. What other uses of semantic technology do you see gaining significance in 2009? What semantic considerations do you bring to your product and research activities?

Customers recognize the value of linking entities and concepts via semantic identifiers. There’s effectively a semantic engine at the heart of I2E and so semantic knowledge discovery is core to what we do.  I2E is also often used for data-driven discovery of synonyms, and association of these with appropriate concept identifiers.

In the life science domain commonly used identifiers such as gene ids already exist.  However, a more comprehensive identification of all types of entities and relationships via semantic web style URIs could still be very valuable.

Where can I find more information about your products, services, and research?

Please contact Susan LeBeau (susan.lebeau@linguamatics.com, tel: +1 774 571 1117) and visit www.linguamatics.com.

Stephen Arnold (ArnoldIT.com) and Harry Collier (Infonortics, Ltd.), February 16, 2009

Semantic Engines Dmitri Soubbotin Exclusive Interview

February 10, 2009

Semantics are booming. Daily I get spam from the trophy generation touting the latest and greatest in semantic technology. A couple of eager folks are organizing a semantic publishing system and gearing up for a semantic conference. I think these efforts are admirable, but I think that the trophy crowd confuses public relations with programming on occasion. Not Dmitri Soubbotin, one of the senior managers at Semantic Engines. Harry Collier and I were able to get the low-profile wizard to sit down and talk with us. Mr. Soubbotin’s interview with Harry Collier (Infonortics Ltd.) and me appears below.

Please, keep in mind that Dmitri Soubbotin is one of world class search, content processing, and semantic technologies experts who will be speaking at the April 2009 Boston Search Engine Meeting. Unlike fan-club conferences or SEO programs designed for marketers, the Boston Search Engine Meeting tackles substantive subjects in an informed way. The opportunity to talk with Mr. Soubbotin or any other speaker at this event is a worthwhile experience. The interview with Mr. Soubbotin makes clear the approach that the conference committee for the Boston Search Engine Meeting. Substance, not marketing hyperbole is the focus for the two day program. For more information and to register, click here.

Now the interview:

Will you describe briefly your company and its search / content
processing technology?

Semantic Engines is mostly known for its search engine SenseBot (www.sensebot.net). The idea of it is to provide search results for a user’s query in the form of a multi-document summary of the most relevant Web sources, presented in a coherent order. Through text mining, the engine attempts to understand what the Web pages are about and extract key phrases to create a summary.

So instead of giving a collection of links to the user, we serve an answer in the form of a summary of multiple sources. For many informational queries, this obviates the need to drill down into individual sources and saves the user a lot of time. If the user still needs more detail, or likes a particular source, he may navigate to it right from the context of the summary.

Strictly speaking, this is going beyond information search and retrieval – to information synthesis. We believe that search engines can do a better service to the users by synthesizing informative answers, essays, reviews, etc., rather than just pointing to Web sites. This idea is part of our patent filing.

Other things that we do are Web services for B2B that extract semantic concepts from texts, generate text summaries from unstructured content, etc. We also have a new product for bloggers and publishers called LinkSensor. It performs in-text content discovery to engage the user in exploring more of the content through suggested relevant links.

What are the three major challenges you see in search / content processing in 2009?

There are many challenges. Let me highlight three that I think are interesting:

First,  Relevance: Users spend too much time searching and not always finding. The first page of results presumably contains the most relevant sources. But unless search engines really understand the query and the user intent, we cannot be sure that the user is satisfied. Matching words of the query to words on Web pages is far from an ideal solution.

Second, Volume: The number of results matching a user’s query may be well beyond human capacity to review them. Naturally, the majority of searchers never venture beyond the first page of results – exploring the next page is often seen as not worth the effort. That means that a truly relevant and useful piece of content that happens to be number 11 on the list may become effectively invisible to the user.

Third, Shallow content: Search engines use a formula to calculate page rank. SEO techniques allow a site to improve its ranking through the use of keywords, often propagating a rather shallow site up on the list. The user may not know if the site is really worth exploring until he clicks on its link.

With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?

Not understanding the intent of the user’s query and matching words syntactically rather than by their sense – these are the key barriers preventing from serving more relevant results. NLP and text mining techniques can be employed to understand the query and the Web pages content, and come up with an acceptable answer for the user. Analyzing
Web page content on-the-fly can also help in distinguishing whether a page has value for the user or not.
Of course, the infrastructure requirements would be higher when semantic analysis is used, raising the cost of serving search results. This may have been another barrier to broader use of semantics by
major search engines.

What is your approach to problem solving in search and content processing? Do you focus on smarter software, better content processing, improved interfaces, or some other specific area?

Smarter, more intelligent software. We use text mining to parse Web pages and pull out the most representative text extracts of them, relevant to the query. We drop the sources that are shallow on content, no matter how high they were ranked by other search engines. We then order the text extracts to create a summary that ideally serves as a useful answer to the user’s query. This type of result is a good fit for an informational query, where the user’s goal is to
understand a concept or event, or to get an overview of a topic. The closer together are the source documents (e.g., in a vertical space), the higher the quality of the summary.

Search / content processing systems have been integrated into such diverse functions as business intelligence and customer support. Do you see search / content processing becoming increasingly integrated
into enterprise applications?

More and more, people expect to have the same features and user interface when they search at work as they get from home. The underlying difference is that behind the firewall the repositories and taxonomies are controlled, as opposed to the outside world. On one hand, it makes it easier for a search application within the enterprise as it narrows its focus and the accuracy of search can get higher. On the other hand, additional features and expertise would be required compared to the Web search. In general, I think the opportunities in the enterprise are growing for standalone search
providers with unique value propositions.

As you look forward, what are some new features / issues that you think will become more important in 2009? Where do you see a major break-through over the next 36 months?

I think the use of semantics and intelligent processing of content will become more ubiquitous in 2009 and further. For years, it has been making its way from academia to “alternative” search engines, occasionally showing up in the mainstream. I think we are going to see much higher adoption of semantics by major search engines, first of all Google. Things have definitely been in the works, showing as small improvements here and there, but I expect a critical mass of
experimenting to accumulate and overflow into standard features at some point. This will be a tremendous shift in the way search is perceived by users and implemented by search engines. The impact on the SEO techniques that are primarily keyword-based will be huge as well. Not sure whether this will happen in 2009, but certainly within
the next 36 months.

Graphical interfaces and portals (now called composite applications) are making a comeback. Semantic technology can make point and click interfaces more useful. What other uses of semantic technology do you see gaining significance in 2009? What semantic considerations do you bring to your product and research activities?

I expect to see higher proliferation of Semantic Web and linked data. Currently, the applications in this field mostly go after the content that is inherently structured although hidden within the text – contacts, names, dates. I would be interested to see more integration of linked data apps with text mining tools that can understand unstructured content. This would allow automated processing of large volumes of unstructured content, making it semantic web-ready.

Where can we find more information about your products, services, and research?

Our main sites are www.sensebot.net and www.semanticengines.com. LinkSensor, our tool for bloggers/publishers is at www.linksensor.com. A more detailed explanation of our approach with examples can be found in the following article:
http://www.altsearchengines.com/2008/Q7/22/alternative-search-results/.

Stephen Arnold (Harrod’s Creek, Kentucky) and Harry Collier (Tetbury, Glou.), February 10, 2009

Lexalytics’ Jeff Caitlin on Sentiment and Semantics

February 3, 2009

Editor’s Note: Lexalytics is one of the companies that is closely identified with analyzing text for sentiment. When a flow of email contains a negative message, Lexalytics’ system can flag that email. In addition, the company can generate data that provides insight into how people “feel” about a company or product. I am simplifying, of course. Sentiment analysis has emerged as a key content processing function, and like other language-centric tasks, the methods are of increasing interest.

Jeff Caitlin will speak at what has emerged as the “must attend” search and content processing conference in 2009. The Infonortics’ Boston Search Engine meeting features speakers who have an impact on sophisticated search, information processing, and text analytics. Other conferences respond to public relations; the Infonortics’ conference emphasizes substance.

If you want to attend, keep in mind that attendance at the Boston Search Engine Meeting is limited. To get more information about the program, visit the Infonortics Ltd. Web site at www.infonortics.com or click here.

The exclusive interview with Jeff Caitlin took place on February 2, 2009. Here is the text of the interview conducted by Harry Collier, managing director of Infonortics and the individual who created this content-centric conference more than a decade ago. Beyond Search has articles about Lexalytics here and here.

Will you describe briefly your company and its search / content processing technology?

Lexalytics is a Text Analytics company that is best known for our ability to measure the sentiment or tone of content. We plug in on the content processing side of the house, and take unstructured content and extract interesting and useful metadata that applications like Search Engines can use to improve the search experience. The types of metadata typically extracted include: Entities, Concepts, Sentiment, Summaries and Relationships (Person to Company for example).

With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?

The simple fact that machines aren’t smart like people and don’t actually “understand” the content it is processing… or at least it hasn’t to date. The new generation of text processing systems have advanced grammatic parsers that are allowing us to tackle some of the nasty problems that have stymied us in the past. One such example is Anaphora resolution, sometimes referred to as “pronominal preference”, which is a bunch of big confusing sounding words to explain the understanding of “pronouns”. If you took the sentence, “John Smith is a great guy, so great that he’s my kids godfather and one of the nicest people I’ve ever met.” For people this is a pretty simple sentence to parse and understand, but for a machine this has given us fits for decades. Now with grammatic parsers we understand that “John Smith” and “he” are the same person, and we also understand who the speaker is and what the subject is in this sentence. This enhanced level of understanding is going to improve the accuracy of text parsing and allow for a much deeper analysis of the relationships in the mountains of data we create every day.

What is your approach to problem solving in search and content processing? Do you focus on smarter software, better content processing, improved interfaces, or some other specific area?

Lexalytics is definitely on the better content processing side of the house, our belief is that you can only go so far by improving the search engine… eventually you’re going to have to make the data better to improve the search experience. This is 180 degrees apart from Google who focus exclusively on the search algorithms. This works well for Google in the web search world where you have billions of documents at your disposal, but hasn’t worked as well in the corporate world where finding information isn’t nearly as important as finding the right information and helping users understand why it’s important and who understands it. Our belief is that metadata extraction is one of the best ways to learn the “who” and “why” of content so that enterprise search applications can really improve the efficiency and understanding of their users.

With the rapid change in the business climate, how will the increasing financial pressure on information technology affect search / content processing?

For Lexalytics the adverse business climate has altered the mix of our customers, but to date has not affected the growth in our business (Q1 2009 should be our best ever). What has clearly changed is the mix of customers investing in Search and Content Processing, we typically run about 2/3 small companies and 1/3 large companies. In this environment we are seeing a significant uptick in large companies looking to invest as they seek to increase their productivity. At the same time, we’re seeing a significant drop in the number of smaller companies looking to spend on Text Analytics and Search. The Net-Net of this is that if anything Search appears to be one of the areas that will do well in this climate, because data volumes are going up and staff sizes are going down.

Microsoft acquired Fast Search & Transfer. SAS acquired Teragram. Autonomy acquired Interwoven and Zantaz. In your opinion, will this consolidation create opportunities or shut doors. What options are available to vendors / researchers in this merger-filled environment?

As one of the vendors that works closely with 2 of the 3 the major Enterprise Search vendors we see these acquisitions as a good thing. FAST for example seems to be a well-run organization under Microsoft, and they seem to be very clear on what they do and what they don’t do. This makes it much easier for both partners and smaller vendors to differentiate their products and services from all the larger players. As an example, we are seeing a significant uptick in leads coming directly from the Enterprise Search vendors that are looking to us for help in providing sentiment/tone measurement for their customers. Though these mergers have been good for us, I suspect that won’t be the case for all vendors. We work with the enterprise search companies rather than against them, if you compete with them this may make it even harder to be considered.

As you look forward, what are some new features / issues that you think will become more important in 2009? Where do you see a major break-through over the next 36 months?

The biggest change is going to be the move away from entities that are explicitly stated within a document to a more ‘fluffy’ approach. Whilst this encompasses things like inferring directly stated relationships – “Joe works at Big Company Inc” – is a directly stated relationship it also encompasses being able to infer this information from a less direct statement. “Joe, got in his car and drove, like he did everyday his job at Big Company Inc.” It also covers things like processing of reviews and understanding that sound quality is a feature of an iPod from the context of the document, rather than having a specific list. It also encompasses things of a more semantic nature. Such as understanding that a document talking about Congress is also talking about Government, even though Government might not be explicitly stated.

Graphical interfaces and portals (now called composite applications) are making a comeback. Semantic technology can make point and click interfaces more useful. What other uses of semantic technology do you see gaining significance in 2009? What semantic considerations do you bring to your product and research activities?

One of the key uses of semantic understanding in the future will be in understanding what people are asking or complaining about in content. It’s one thing to measure the sentiment for an item that you’re interested in (say it’s a digital camera), but it’s quite another to understand the items that people are complaining about while reviewing a camera and noting that the “the battery life sucks”. We believe that joining the subject of a discussion to the tone for that discussion will be one of the key advancements in semantic understanding that takes place in the next couple of years.

Where can I find out more about your products, services and research?

Lexalytics can be found on the web at www.lexalytics.com. Our Web log discusses our thoughts on the industry: www.lexalytics.com/lexablog. A downloadable trial is available here. We also have prepared a white paper, and you can get a copy here.

Harry Collier, February 3, 2009

Frank Bandach, Chief Scientist, eeggi on Semantics and Search

February 2, 2009

An Exclusive Interview by Infonortics Ltd. and Beyond Search

Harry Collier, managing director and founder of the influential Boston Search Engine Meeting, interview Frank Bandach, chief scientist, eeggi, a semantic search company, on January 27, 2009. eeggi has maintained a low profile. The interview with Mr. Bandach is among the first public descriptions of the company’s view of the fast-changing semantic search sector.

The full text of the interview appears below.

Will you describe briefly your company and its search technology?

We are a small new company implementing our very own new technology. Our technology is framed in a rather controversial theory of natural language, exploiting the idea that language itself is a predetermined structure, and as we grow, we simply feed new words to increase its capabilities and its significance. In other words, our brains did not learn to speak but we were rather destined to speak. Scientifically speaking, eeggi is mathematical clustering structure which models natural language, and therefore, some portions of rationality itself. Objectively speaking, eeggi is a linguistic reasoning and rationalizing analysis engine. As a linguistic reasoning engine, is then only natural, that we find ourselves cultivating search, but also other technological fields such as Speech recognition, Concept analysis, Responding, Irrelevance Removal, and others.

What are the three major challenges you see in search in 2009?

The way I perceive this, is that many of the challenges facing search in 2009 (irrelevance, nonsense, and ambiguity) I believe are the same that were faced in previous years. I think that simply our awareness and demands are increasing, and thus require for smarter and more accurate results. This is after all, the history of evolution.

With search decades old, what have been the principal barriers to resolving these challenges in the past?

These problems (irrelevance, nonsense, and ambiguity) have currently being addressed through Artificial Intelligence. However, AI is branched into many areas and disciplines, and AI is also currently evolving and changing. Our approach is unique and follows a completely different attitude, or if I may say, spirit than that from current AI disciplines.

What is your approach to problem solving in search? Do you focus on smarter software, better content processing, improved interfaces, or some other specific area?

Our primary approach is machine intelligence focusing in zero irrelevance, while allowing for synonyms, similarities, rational disambiguation of homonyms or multi-conceptual words, dealing with collocations as unit concepts, grammar, permitting rationality and finally information discovery.

With the rapid change in the business climate, how will the increasing financial pressure on information technology affect search?

The immediate impact of a weak economy, affects all industries, but the fore long impact will be absorbed and disappeared. The future belongs to technology. This is indeed the principle that was ignited long ago with the industrial revolution. It is true, the world faces many challenges ahead, but technology is the reflection of progress, and technology is uniting us day by day, allowing, and at times forcing us, to understand, accept, and admit our differences. For example, unlike ever before, United Sates and India are now becoming virtual neighbors thanks to the Internet.

Search systems have been integrated into such diverse functions as business intelligence and customer support. Do you see search becoming increasingly integrated into enterprise applications? If yes, how will this shift affect the companies providing stand alone search / content processing solutions? If no, what do you see the role of standalone search / content processing applications becoming?

Form our stand, search, translation, speech recognition, machine intelligence, … for all matters, language, all fall under a single umbrella which we identify thorough a Linguistic Reasoning and Rationalization Analysis engine we call eeggi.

Is that an acronym?

Yes. eeegi is shorthand for “”engineered, encyclopedic, global and grammatical identities”.

As you look forward, what are some new features / issues that you think will become more important in 2009? Where do you see a major break-through over the next 36 months?

I truly believe that users will become more and more critical of irrelevance and the quality of their results. New generations will be, and are, more aware and demanding of machine performance. For example, while in my youth to have two little bars and a square in the middle represented a tennis match, and it was an exiting experience, in today’s standards, presenting the same scenario to a kid, will be become a laughing matter. As newer generations move in, foolish results will not form part in their minimum of expectations.

Mobile search is emerging as an important branch of search. Mobile search, however, imposes some limitations on presentation and query submission. What are your views of mobile search’s impact on more traditional enterprise search / content processing?

This is a very interesting question… The functionalities and applications from several machines inevitably begin to merge the very instant that technology permits miniaturization or when a single machine can efficiently evolve and support the applications of the others. Most of the times, is the smallest machine the one that wins. It is going to be very interesting to see how cell phones move, more and more, into fields that were reserved exclusively to computers. It is true, that cell phones by nature, need to integrate small screens, but new folding screens and even projection technologies could do for much larger screens, and as Artificial Intelligence takes on challenges before only available through user performance, screens them selves may move into a secondary function. After all, you and me, we are now talking without implementing any visual aid or for that purpose, screen.

Where can I find more information about your products, services, and research?

We are still a bit into stealth mode. But we have a Web site (eeggi.com) that displays and discusses some basic information. We hope, that by November of 2009, we would have build sufficient linguistic structures to allow eeggi to move into automatic learning of other languages with little, or possibly, no aid from natural speakers or human help.

Thank you.

Harry Collier, Managing Director, Infonortics Ltd.

A Shadow Falls on Search Related Conferences

January 27, 2009

I had a couple of conversations yesterday with conference organizers telling me that I was completely wrong in my opinion piece “Conference Spam or Conference Prime Rib” here. I enjoy a lively debate. I like intense discussions even more when I have no interest whatsoever in the trials and tribulations of conference organizers. My point remains valid; that is, in a lousy economy conferences that don’t deliver value will be big losers. Forget the monetary side of the show.

image

With conferences struggling to survive and some big outfits allegedly cutting back in fuzzy-wuzzy business sectors like content management, the caption “When your best just isn’t good enough” is apt in my opinion. Image source: http://aviationweek.typepad.com/photos/uncategorized/2007/05/21/failure.gif

I received this thoughtful post as well:

I’m also curious how my own touting of the SIGIR Industry Track fares vis a vis your spam filter. I’m personally excited to be involved with an event that is not beholden to any vendor or analyst, but rather to the world’s most reputable organization in the area of information retrieval: the Association for Computing Machinery Special Interest Group on Information Retrieval (SIGIR).

More details about the event are forthcoming, but let me share an important one: none of the speakers or their employers are paying to be on the agenda. Rather, the agenda consists of invited talks and panels, vetted by the SIGIR Organizing Committee (http://www.sigir2009.org/about/organizers). The model for the event is last year’s CIKM Industry Event (http://www.cikm2008.org/industry_event.php), but I’ll be so bold as to say we’re stepping it up a notch.

This isn’t a vendor user conference like Endeca Discover or FASTForward, nor is it a “vendor-neutral” conference in name only where vendors, analysts, and consultants are paying for air time. And, while it won’t be free, it is being run by a non-profit organization whose goal is to serve the community, not to line its pockets.I hope that you and others will support this welcome change.

When people run conferences that don’t have magnetism, the real losers are the attendees who spend money and invest time to hear lousy speakers or sales pitches advertised as original, substantive talks. The other losers are exhibitors who can spend $10,000 on a minimal exhibit and get zero sales leads. In fact, there are negatives to lousy shows; to wit:

  • Attendees don’t learn anything useful or attendees hear speakers who simply don’t know of what they speak. That’s okay when the presenter is a luminary like Steve Ballmer or Werner Vogels. But for a session on “Tips to Reduce the Cost of Enterprise Search” and the solution is a rehash of how “easy” and “economical” a Google Search Appliance is, I leave the session. It’s baloney. Attendees don’t learn anything from these talks, but there is often desperation among the organizing committee to find someone who will show up and do a basic talk. The notion of “quality” is often secondary to thoughts about the speaker’s turning up on the podium.
  • Exhibitors don’t make sales. Enough said.
  • Exhibitors find themselves either [a] talking to competitors because there’s no traffic in the exhibit hall or [b] watching their employees talk with other companies’ senior management and maybe landing a new job. Either of these situations is one that will make a vendor pull out of a trade show.
  • Media who actually show up and attend a session don’t find a story. The trade show is, therefore, a media non event. I can name one big, confused show in Boston that suffers this ignominy. What began as a show about microfilm now tries to embrace everything from photocopying to enterprise content management to business intelligence. Crazy. No one knows what the show is about so the media avoid it. Heck, I avoid it.

What must conferences do to avoid this problem?

Read more

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta