Search, Content Processing, and the Great Database Controversy

January 22, 2008

“The Typical Programmer” posted the article “Why Programmers Don’t Like Relational Databases,” and ignited a mini-bonfire on September 25, 2007. I missed the essay when it first appeared, but a kind soul forwarded it to me as part of an email criticizing my remarks about Google’s MapReduce.

I agreed with most of the statements in the article, and I enjoyed the comments by readers. When I worked in the Keystone Steel Mill’s machine shop in college, I learned two things: [a] don’t get killed by doing something stupid and [b] use the right tool for every job.

If you have experience with behind-the-firewall search systems and content processing systems, you know that there is no right way to handle data management tasks in these systems. If you poke around the innards of some of the best-selling systems, you will find a wide range of data management and storage techniques. In my new study “Beyond Search,” I don’t focus on this particular issue because most licensees don’t think about data management until their systems run aground.

Let me highlight a handful of systems (without taking sides or mentioning names) and the range of data management techniques employed. I will conclude by making a few observations about one of the many crises that bedevil some behind-the-firewall search solutions available today.

The Repository Approach. IBM acquired iPhrase in 2005. The iPhrase approach to data management was similar to that used by Teratext. The history of Teratext is interesting, and the technology seems to have been folded back into the giant technical services firm SAIC. Both of these systems ingest text, store the source content in a transformed state, and create proprietary indexes that support query processing. When a document is required, the document is pulled from the repository. When I asked both companies about the data management techniques for used in their systems for the first edition of The Enterprise Search Report (2003-2004), I got very little information. What I recall from my research is that both systems used a combination of technologies integrated into a system. The licensee was insulated from the mechanics under the hood. The key point is that two very large systems able to handle large amounts of data relied on data warehousing and proprietary indexes. I heard when IBM bought iPhrase that one reason for IBM’s interest was the iPhrase customers were buying hardware from IBM in prodigious amounts. The fact that Teratext is unknown in most organizations is that it is one of the specialized tools purpose-built to handle CIA- and NSA-grade information chores.

The Proprietary Data Management Approach. One of the “Big Three” of enterprise search has created its own database technology, its own data management solution, and its own data platform. The reason is that this company was among the first to generate a significant amount of metadata from “intelligent” software. In order to reduce latency and cope with the large temporary files iterative processing generated, the company looked for an off-the-shelf solution. Not finding what it needed, the company’s engineers solved the problem and even today “bakes in” its data management, base, and manipulation components. When this system is licensed on an OEM (original equipment manufacturing product), the company’s own “database” lives within the software built by the licensee. Few are aware of this doubling up of technology, but it works reasonably well. When a corporate customer of a content management system wants to upgrade the search system included in the CMS, the upgrade is a complete version of the search system. There is easy way to get around the need to implement a complete, parallel solution.

The Fruit Salad Approach. A number of search and content processing companies deliver a fruit salad of data solutions in a single product. (I want to be vague because some readers will want to know who is delivering systems with these hybrid systems, and I won’t reveal the information in a public forum. Period.) Poke around and you will find open source database components. MySQL is popular, but there are other RDBMS offerings available, and depending on the vendor’s requirements, the best open source solution will be selected. Next, the vendor’s engineers will have designed a proprietary index. In many cases, the structure and details of the index are closely-guarded secrets. The reason is that the speed of query processing is often related to the cleverness of the index design. What I have found is that companies that start at the same time usually same similar approaches. I think this is because when the engineers were in university, the courses taught the received wisdom. The students then went on to their careers and tweaked what was learned in college. Despite the assertions of uniqueness, I find interesting coincidences based on this education factor. Finally, successful behind-the-firewall search and content processing companies license, buy, or are the beneficiaries of a helpful venture capital firm. The company ends up with different chunks of code, and in many cases, it is easier to use whatever is there than trying to figure out and make the solution work with the pears, apricots, and apples in use elsewhere in the company.

The Leap Frog. One company has designed a next-generation data management system. I talk about this technology in my column for one of Information Today’s tabloids, so I won’t repeat myself. This approach says, in effect: Today’s solutions are quite right for the petabyte-scale of some behind-the-firewall indexing tasks. The fix is to create something new, jumping over Dr. Codd, RDBMS, the costs of scaling, etc. When this data management technology becomes commercially available, there will be considerable pressure placed upon IBM, Microsoft, Oracle; open source database and data management solutions; and companies asserting “a unique solution” while putting old wine in new bottles.

Let me hazard several observations:

First, today’s solutions must be matched to the particular search and content processing problem. The technology, while important, is secondary to your getting what you want done within the time and budget parameters you have. Worrying about plumbing when the vendors won’t or can’t tell you what’s under the hood is not going to get your system up and running.

Second, regardless of the database, data management, or data transformation techniques used by a vendor, the key is reliability, stability, and ease of use from the point of view of the technical professional who has to keep the system up and running. You might want to have a homogeneous system, but you will be better off getting one that keeps your users engaged. When the data plumbing is flawed, look first at the resources available to the system. Any of today’s approaches work when properly resourced. Once you have vetted your organization, then turn your attention to the vendor.

Third, the leap frog solution is coming. I don’t know when, but there are researchers at universities in the U.S. and in other countries working on the problems of “databases” in our post-petabyte world. I appreciate the arguments from programmers, database administrators, vendors, and consultants. They are all generally correct. The problem, however, is that none of today’s solutions were designed to handle the types or volumes of information sloshing through networks today.

In closing, as the volume of information increases, today’s solutions — CICS, RDBMS, OODB and other approaches — are not the right tool for tomorrow’s job. As a pragmatist, I use what works for each engagement. I have given up trying to wrangle the “facts” from vendors. I don’t try to take sides in the technical religion wars. I do look forward to the solution to the endemic problems of big data. If you don’t believe me, try and find a specific version of a document. None of the approaches identified above can do this very well. No wonder users of behind-the-firewall search systems are generally annoyed most of the time. Today’s solutions are like the adult returning to college, finding a weird new world, and getting average marks with remarkable consistency.

Stephen Arnold, January 22, 2008

Comments

2 Responses to “Search, Content Processing, and the Great Database Controversy”

  1. Beginning Level PHP Security Logging - WebProWorld on January 24th, 2008 11:05 am

    […] Re: Beginning Level PHP Security Logging Two related articles: Typical Programmer – Why Programmers Don’t Like Relational Databases Beyond Search » Search, Content Processing, and the Great Database Controversy __________________ Kjell Gunnar Bleivik:: Financial information at your fingertips Learn object oriented programming where it started […]

  2. Database Publishing: Federated Search And Content Aggregation Are The Future | Communication on January 30th, 2008 5:43 am

    […] Steven Arnold writes a thoughtful post on his Beyond Search blog about the inadequacy of traditional databases and search engines to deal with organizing and delivering content when the Web and many private content collections measure in petabytes and exabytes of information. […]

  • Archives

  • Recent Posts

  • Meta