Connectors: A Big Deal for Enterprise Search

June 22, 2008

In my travels on June 17 to June 20, 2008, I participated in three conversations about connectors. A connector is a program that converts a source document in one format to some other format. The idea is that a connector makes it possible for a content processing system to ingest content with a minimum of computational hassle.

For some reason, connectors are the topic of the moment in the circles in which I move. In 2005, I discovered a userful white paper on the Web site of Persistent. I had to dig the link out in order to send it to the people with whom I met last week, and I thought some readers of Beyond Search might find it useful as well.

The Persistent document is “Unified Connectors for Enterprise Search Softwares.” You can download the document here. An individual is not identified as the author. I found the write up useful.

Persistent is a firm providing software consulting, engineering, and outsourcing. Persistent makes additional information available at its Web site here.

Stephen Arnold, June 22, 2008

Written by Stephen E. Arnold · Filed Under Enterprise, News, Search

Comments

7 Responses to “Connectors: A Big Deal for Enterprise Search”

Andreas ringdal on June 22nd, 2008 1:03 am

“A connector is a program that converts a source document in one format to some other format.”
I disagree on this definiton, I’d rather say it is a program that connects to a source system to retrieve the documents. Converting the actual documents is done by separate programs/libraries like Oracles Outside-In.
In situations where the source system does not contain documents/files, the connector provides the text ready for the Indexer.

A file system connector merely retrieves the documents from the source server and leaves it to the import module to convert them.

Andreas
Stephen E. Arnold on June 22nd, 2008 9:00 am

Andreas, thank you. Let’s see what other inputs arrive. I will then update the definitions to reflect these contributions. Please, keep providing input. I want to balance my opinions with useful reference information.

Stephen Arnold, June 22, 2008
Otis Gospodnetic on June 22nd, 2008 9:51 pm

The author is Swapnil A. Paranjpe, no? The name is in the paper.

Andreas has a point. It is one thing to connect to different information/document stores (e.g. a mail server is a document storage of sorts and you communicate with it via a POP3 or IMAP or some other protocol), and the other is to process what comes out of the data stores and convert it to a format suitable for, say, indexing. Such converters are also commonly called simply “parsers”, or “rich document parsers”. Chapter 7 in Lucene in Action (see http://www.amazon.com/Lucene-Action-Otis-Gospodnetic/dp/1932394281 ) has a little framework for doing the parsing and subsequent indexing of “rich documents” (e.g. PDF, MS Word, RTF, XML, HTML…) for example.
Stephen E. Arnold on June 23rd, 2008 12:57 pm

Otis, thanks for the link. Keep helping me and the readers stay on the right track.
Stephen Arnold, June 23, 2008
Charlie Hull on June 24th, 2008 8:07 am

There is a way of extracting text and structure data from files of many formats built in to most modern versions of Windows: the Microsoft IFilter interface standard (http://www.ifilter.org/) . For database connectors, ODBC is useful. Of course, you also need a good way to map the fields in a document into metadata useful for searching (i.e. the document title or summary). Some third parties even provide free converters using this interface (i.e. Adobe make a freely downloadable PDF IFilter plugin).

There are also numerous open source text converters: for example AbiWord provides ways of converting a lot of word processor formats. On Unix systems these can often be the only non-proprietary way of extracting text for indexing.
Swapnil Paranjpe on June 30th, 2008 2:50 pm

Let me be very frank and clarify (1) lot has changed over last 3 years in the data integration space in general and enterprise content search in specific; so some sections of my white paper are probably overly due for updates. (2) my understanding of this space has also updated over time so that should also be factored in.

In context of enterprise search we define ‘connectors’ (or adaptor or integration components or extenders as they are called interchangebly) as software components that enable extend rich functionality of any enterprise search product (e.g. Google search appliance or Oracle secure enterprise search or Thunderstone) to search content that natively resides inside enterprise applications – ranging from pure play content management systems to other applications like ERPs, CRMs, helpdesks, proprietary apps, etc.
In doing so, we do not restrict only to unstructured content but also look at business driven structured entities. Enhancing information reachout and informed & faster decision making in enterprises is the primary goal of this endeavor.
Regarding conversion of a file attachment from some proprietary or standard format to ‘index’ ready format is only small part of the exercise (other critical ones being managing enterprise security, data currency, optimizing data flow, run-time access, distinguishing content that is seen v/s indexed, etc). It is usually already done by the commercial products (with which we normally partner and integrate). For opensource platforms like Lucene, there may be additional work to be done.
Stephen E. Arnold on July 2nd, 2008 8:58 am

Swapnil, thanks for the information. Why not update the paper and let me post it?

Stephen Arnold, July 2, 2008

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.