Connectors: Rounding Up Some Definitions

June 22, 2008

I received an email this morning (June 22, 2008). The writer asked, “Are connectors the same as filters?” As I walked the world’s most wonderful dog, I considered this question. This short essay is a summary of my thoughts. If you have other concepts and definitions to add, please, use the comments section to share them with me and the three other readers of this Web log. Ooops. There may be four readers. I sent a link to my father and he often looks at what I write.

Connectors

Let us look at what Google provides as a definition. Enter the query “define:connectors” and the Google returns nine definitions. A quick scan of the links and the text snippets provides a useful starting point; specifically:

A growing collection of libraries that abstract the interfaces of specific hardware or enterprise integration methods. (More here)

Google offers a number of related phrases to assist me, but none of these seem to relate to enterprise search, content processing, or text analytics.

800px-Pyrite_Fools_Gold_Macro_1

Is this gold or fool’s gold? Without a formal method for testing, even experience rock hounds may not know what the substance is. Software can deliver valuable functions or deliver a lower value operation.

File Conversion

Language is tricky, and business English is a slippery type of language. I know, for example, that some companies provide file conversion tools. So, what is the meaning of file conversion. For a definition, I turn to one of the vendors offering file conversion software. I enter the query “stellent file conversion” into Google and get a pointer to Stellent’s Dynamic Conversion Process“. This is a function that takes content from a Web page or other source and makes it Web viewable.

I recalled licensing software under the name “Outside In”, and my recollection was that Stellent bought this company and continued to sell the product. My recollection is that Stellent’s software components could take a file in one format such as XyWrite III+ and convert it to Microsoft Rich Text Format. Few people today need to convert XyWrite files, but the US House of Representatives still has some XyWrite files kicking around even today I heard.

I consulted my digital archive and located this explanation of the software. I am going to paraphrase the description to pull out the key point: The technology allows developers the ability to “view, filter and convert more than 225 file formats without using native applications.

After a bit of poking around I located a description of Outside In on the Oracle Web site here. Oracle purchased Stellent, and you can license the software from Oracle. The most recent version of the Outside In software performs a number of functions important to file conversion, which seems to be the main thrust of Oracle’s description of the Outside In technology; specifically Oracle says:

  • Clean Content—Identifies and scrubs risky hidden data from Microsoft Office documents
  • Content Access—Extracts text and metadata from more than 400 file types
  • File ID—Quickly and accurately identifies file types
  • HTML Export—Converts files into HTML rendering embedded graphics as a GIF, JPEG, or PNG
  • Image Export—Converts files into TIFF, JPEG, BMP, GIF, or PNG images
  • PDF Export—Converts files into PDF without native applications or 3rd party libraries
  • Search Export—Converts files into one of four formats designed specifically for search
  • Viewer—Renders high-fidelity views of files and allows printing, copy/paste, and annotations
  • XML Export—Converts and normalizes files into XML that defines properties, content, and structure

Oracle’s checklist provides a good round up of the bits and pieces that comprise file conversion functions. It seems that we have a definition of sorts. Note: Oracle provides a useful 2007 white paper to help you navigate through the sub concepts embedded in the Outside In system here.

File conversion–A software that performs a number of separate operations to change a file in one format to another format. The purpose of file conversion is to eliminate the need to open a native applications such as XyWrite to export a file in a different format.

But what about information in a database like IBM’s DB2, Microsoft’s SQL Server, and Oracle’s database? Well, these file types are widely used in organizations, and it is easy for a database administrator to export a relational database as a comma separated value file or in what is called the CSV format. Also, may systems can “read” database files or database reports. But I have heard that these features do not work on certain types of information stored in a database; for example, the database contains row and column headings that are not plain English or the cells in the database are filled with numerical strings that are codes.

One work around is to write a report, query the database, save the answer to the query as HTML or XML and then process those HTML or XML files as individual documents. But that seems like a great deal of work. What happens to those cryptic row and column headings? What does the report do to make the values in the cells understandable to a human.

We don’t need file conversion. We need another process? What is it called?

Transformation

EMC, the company “where information lives,” offers what it calls “content transformation services.” The company defines its software and services in this way:

EMC Documentum Content Transformation Services make it easy to transform content into formats for multiple channels such as the web, print, mobile phones, and video broadcast. Each product within the content transformation services suite focuses on a specific set of content formats and uses a common, robust framework. When combined, they offer you the ability to transform common desktop documents and rich media formats. The result is more efficient and standardized content transformation and analysis for content across the entire organization.

Please, explore the EMC information at your leisure, but I did not think that the problem of transforming in the context of filtering is exactly what I need.

We have a database stuffed full of XML, and we need to create a version of that XML that makes sense to a search engine and to a human. We don’t want individual paragraphs indexed as if each was a separate document. We want to be able to view a paragraph without losing the connection to the complete document. Also, the row and column headings yield index terms. We also want to index the entire document in order to retrieve the name of a person or a specific phrase like “White House”.

What is the term for this type of conversion?

Back to Google with this query: “database transformation xml”. Google provided 337,000 links, so I did what most people do. I clicked on the first five or six items in the result set and hit upon a useful explanation of the concept and a software tool that would provided the function I needed; namely, an essay on O’Reilly’s XML.com Web site called “DB/XML Transform”. The essay is a product profile, and it provided this nugget:

DB/XML Transform 2.0 provides a powerful engine for bi-directional data transformation between XML, database and text formats in any combination. Through XML, transformation from or to other formats such as HTML, XHTML, WML, EDI can be easily achieved. It is best suited for retrieving data from your database and formatting them in XML format in whatever the way you want, and vice versa. It is ideal for building corporate portals, B2B applications, data exchange and database integration solutions.

You can find more detail on the DataMirror Corp. Web site here. What interested me is that this product is available from IBM. A little sleuthing revealed that IBM purchased DataMirror in 2007, a fact which I did not know.

We can now rough out a definition of data transformation:

Data transformation–A series of processes that allow structured data to be retrieved and formatted as XML.

I am not exactly clear on how file conversion differs from data transformation, but it is clear that IBM perceived the significance of DataMirror. IBM said in its 2007 news release:

Following completion of the acquisition, IBM intends to integrate DataMirror with IBM’s Information Management Software unit led by General Manager Ambuj Goyal; employ DataMirror software to support IBM Information Server, IBM’s first-of-a-kind information integration platform, making it easier for clients to apply real-time data integration techniques from a single platform across their businesses; [and] utilize DataMirror technology to bring heterogeneous real time change data capture to clients.

One key point is that IBM wants to handle conversion in real time. The idea that data are copied to a storage device and then processed in batches is too slow for some applications.

Filters

In my pass through my information about file conversion, I noted the word filters. A filter, according to notes I wrote to myself several years ago, are software scripts that statements of criteria to determine what records are selected from a database for a report. A simpler definition might be:

Filter–Criteria for extracting specific data from a source.

Observations

In reviewing the information I keep about processes required for enterprise search, it became clear to me that I did not include sufficient detail in the glossaries in either my first three editions of the Enterprise Search Report nor the more recent Beyond Search study. My flaw was assuming that I knew about the nuances of taking a digital object in format A and reworking it to format B. I will work to fill in the gaps in my understanding because I don’t think these three terms are comprehensive. Other observations include:

  • Defining terms for enterprise search, content conversion, and text processing is tough. The meanings boil down to specific functions in software required by specific applications. The definitions, then, are controlled by the vendor. A definition may be the specifications and features for a specific and unique set of instructions. In short, there is no definition that conveniently wraps up the many meanings and makes explicit what is going on in the change procedures.
  • The procedure for making A into B consists of many sub processes. Because of this, explaining what is happening is difficult. If a change procedure does not work, making a fix may not be easy, quick, or economical. The software may have to be rewritten and there is no guarantee that the revised code will perform flawlessly. You can see this yourself by converting a Framemaker Version 7.0 document first to RTF 1.3 and then to RTF 1.6 and observing the differences. The 1.3 format is more reliable than the newer conversion routine and so far Adobe has not updated that filter on my copy of Framemaker 7.0 and may never “fix” the issues I have.
  • The boundaries between each of the meanings I have tentatively offered are fuzzy. I cannot define precisely where filters differ from some data transformation functions and then from file conversion. I think each is a member of the connector family, but beyond that broad assertion, I am just not certain.

Why is this important to me? I think it under scores for me how difficult it is to communicate exactly what function is required to make an enterprise search or content processing system work as advertised. Agree? Disagree? Have other definitions to offer? Please, use the comment section of the Web log to offer them.

Stephen Arnold, June 23, 2008

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta