An Interview with James Zubok
Brainware evokes two strong images in me. One is smart or intelligent software. The other is the idea that a company name might be too evocative. To find out, I made my way across the weird homogeneity of northern Virginia not far from Eero Saarinen's Dulles Airport, an architectural cake decoration surrounded by featureless buildings stuffed full of US government workers, contractors, consultants, and deli operators. Brainware occupies a new building in Ashburn, Virginia. For those in the know, there are some "hardened" data centers in this chunk of urban sprawl.
The tip off is the concrete barriers and lack of windows. The black Suburbans and their tinted windows tip you off to the obvious --you are in Fed high-tech territory.
After some small talk, Yegor Kuznetsov, Director, Analyst and Media Relations, walked me through the history of Brainware. (Brace yourselves for some serious circumlocutions and procedural joursting. Some of Brainware is not yet into the Spirit of '76, mom, and apple pie.)
In a nutshell, I learned about the company when I was in Germany several years ago. Then the company was a small cog in large German content management and information services firm. This outfit--SER--poured significant resources into text processing research. To make a long story short, some SER management bought out the unit and created Brainware. In the last 12 months, Brainware hit my radar when it landed a prestigious legal client. To make a long story short, I was sitting in the middle of Virginia's featureless "edge city" talking with Brainware's Chief Financial Officer, James Zubok, a lawyer by training. The CEO, Carl Mergele, was out of the country. The text of the interview with Mr. Zubok appears below.
Brainware is a very unusual name, and I have mixed feelings about it. What was the genesis of the name Brainware?
Our history began in approximately 1998, when a German software company called SER Systems AG, spent over $100 million to invent an intelligent search and retrieval technology and intelligent data capture technology. The R&D efforts were led by a group of PhD scientists who developed proprietary neural network and associative memory applications. These technologies would mimic human intelligence and learning for the categorization, search and extraction of useful information from unstructured data like emails or depositions. In a brainstorming session, we came up with Brainware.
What's the background of the "smart software" that powers your search and content processing system?
The Brainware search technology, called Globalbrain, was introduced into the United States in 2001 by SER Systems AG's US-based subsidiary called SER Solutions, Inc. or SER.
In 2002, SER's senior management team executed a management buyout and purchased the stock of SER as well as all rights and intellectual property to the Brainware technology.
Following the MBO, the company operated the Brainware business as one of several business units. Then in 2003, Vista Equity, a private equity firm, invested in SER with a goal of repositioning the company's somewhat diverse business units into appropriate platforms.
This was a big undertaking, and we completed the deal and the restructuring in early 2006. Brainware then became a stand-alone legal entity. Other SER assets were sold.
We kept the senior management team and narrowed our focus to building the Brainware search business. We've been very fortunate. In less than two years we've experienced remarkable growth. Our sales have grown by more than 900 percent and we've doubled our sales force. We're in these larger Ashburn offices because we ran out of space in our previous facility.
What are Brainware's revenues?
We are a privately-held company, and we don't reveal that information. I can tell you that our revenues are healthy and we are profitable.
Your technology relies almost exclusively on pattern matching, a technique used to great effect by Autonomy and Recommind. What's the engineering principle behind this approach, which is quite different from techniques used by other search and content processing firms?
Most "full-text" search applications on the market today sequentially index each word in a document. In such search engines, when a user enters search query comprised of a single word (keyword) or a combination of words separated by Boolean operators (AND, OR, etc.), the entire text of the indexed document is searched to find exact matches to words used in the search query.
If an exact match cannot be found, such search engines typically use a dictionary or some other set of predefined rules to "look up" alternative spellings of words in the query. For example, if the indexed text states "it is necessary for all freshmen to register for at least sixteen credits," and the query includes the word "necesary", there will be no exact match because the word "necesary" was not indexed.
You might be able to find the if the search engine's built-in dictionary recognizes that "necesary" is a misspelling of the indexed word "necessary." If not, you won't be able to locate the document unless you craft another query or resort to manual inspection of files.
Pattern matching techniques are intended to avoid these problems that are inherent in traditional keyword-based search engines.
In my research for the Enterprise Search Report, 1st, 2nd, and 3rd editions, and my new study Beyond Search, I documented a number of issues with pattern matching; for example, when the technique goes off track, users complain about relevancy. What are you doing to address this "pattern matching" issue?
You're right. There are a number of key limitations with this type of an approach in most applications because they still rely on pre-defined rules or dictionaries. Our search application avoids these issues, however, by eliminating the need for such rules and dictionaries. We supplement our pattern matching functionality with contextual analysis techniques.
First, the process is time consuming especially when every word is checked against a built-in dictionary or run through some set of linguistics rules. Second, this type of search has very little, if any, fault tolerance (or "fuzziness") for spelling errors, OCR errors, or the like, and any such tolerance is inherently limited to the pre-defined rules or dictionaries that are built into the application. Third, most such applications do not take into account word order and therefore users cannot successfully search for the content of entire phrases, sentences, or paragraphs of text.
Our Brainware developers came up with, and patented, what we call an "Associative Memory" search approach. We think it eliminates many of the inherent limitations in traditional search engines.
What our system does is truly unique.
I hear "unique" quite a bit from vendors. Many search systems are quite similar "under the hood". What's the Brainware approach?
Instead of indexing each word, we index a binary vector representation of each "trigram" or three-letter string of characters in each word.
Can you give me an example?
Sure, when we index the word "BRAINWARE" we store a representation of the following trigrams: "BRA"; "RAI"; "AIN"; "INW"; etc. We create a similar trigram representation of all of the text in a search query. During a search, instead of trying to match up entire words, we match the trigrams, which allows our application to be incredibly fault tolerant. Even if some of the trigrams are not a match, our search yields relevant results without relying on any dictionaries or other pre-defined rules.
Consider a user typing in the phrase "BRAIN-WARE" into a traditional search application that indexed the word "BRAINWARE." Such engine would yield no results because the string of characters in "BRAINWARE" is different than the string in "BRAIN-WARE." A positive result would only be returned if a built-in dictionary or other pre-defined rule provided that the words BRAIN and WARE separated by a hyphen are equivalent to the single word BRAINWARE. This type of rule is very unlikely to exist in any traditional search application.
By indexing the trigrams of words instead of just the entire words, our search application is fault tolerant without dictionaries. In this example, the combination of trigrams for the data string BRAIN-WARE would be very similar to the combination of trigrams for the word BRAINWARE. Accordingly, our search application would return a result with a confidence level of slightly less than 100 percent in this example.
Your system incorporates a strong work flow component, particularly with regard to optical character recognition? How does your approach different from that of ZyLab (Netherlands) and Arikus (Canada)?
First, the workflow. Our search application "crawls" all types of data repositories. When we ingest documents into a "knowledgebase," which is our version of an index, our software recognizes the document type and uses various filters to access the text and metadata of each document.
Also, we have workflow solutions for our intelligent data capture offerings (they have embedded search capabilities). We have two workflow applications: WF-distiller, which is our principal workflow component that is used for creating and managing workflows of all types of complexities; and A/P-WebDesk, a specialized workflow module built using WF-distiller but used specifically for Accounts Payable management. A/P-WebDesk (which includes A/P-WebDesk for SAP, a version built specifically for seamless integration with SAP) provides an easy-to-use interface to manage the entire invoice processing lifecycle.
ZyLab is a good product, but it is based on keyword type search technology. If the words are not an exact match, relevant documents will not be located. Although ZyLab now claims to support a "fuzzy" search, it is our understanding that the approach relies on pre-defined rules. For example, users have to tell it how many errors are allowed to retrieve the data. Also, it cannot search on context using big chunks of text as queries.
We did not have a chance to test Arikus. As far as we understand, its information retrieval platform is based on linguistic analysis (plain language) and statistical analysis. Also Arikus uses pre-built taxonomies for enhanced information discovery. However, it is still indexes words, and, therefore, has to handle the limitations faced by all other vendors that do the same.
As I said, our solution is totally different - it doesn't need keywords, taxonomies or linguistic algorithms. Therefore, we avoid all limitations imposed by them.
I found that your system allowed me to run a key word query, but I didn't get the full set of exact matches. What can a licensee do to guide the system to recall specific terms, phrases, or unique strings? Is there an external knowledge base function, for example? How does a licensee implement it?
At query time, the user has the ability to configure a number of key parameters. I will try to hit the main points but I may omit one or two features. Okay?
The first parameter is what we call maximum results. - The user can enter the maximum number of results that will be returned for a given query. The users can increase the number of results or even perform an unlimited search.
A second parameter is what we call minimum relevance. This allows the user to specify how fuzzy or confident the software needs to be about a resulting document. The closer the score is to 100 percent relevance, the more exact of a match the result is to the query. Each variation of a word, missing words, word order and other statistical measures start to "penalize" the result and lower the relevance ranking.
My tests showed that your system was quite good when I was examining patents. I could, for example, copy claims and use those as a query and quickly find similar documents. Is this a typical use of your system? What have you done to make the system useful to attorneys?
Both our CEO Carl Mergele and I have quite a bit of legal experience. Both of us, for example, were practicing attorneys.
Lawyers make me nervous. Am I okay sitting here, or will I have to sign a document and take a lie detector test? Put up with a lot of procedural busywork?
No, but this conversation will cost you $750 per hour. (I'm just kidding.)
Anyway, we have the in-depth knowledge of the legal market. Law firms spend significant time and money for third party vendors to image and code legal documents into any number of litigation support applications. The purpose of coding documents is to allow attorneys to use the built-in search applications of litigation support software to search various "objective" fields such as the date, author, and title of the document. Not only is there a margin for error in this step - that is, each document is reviewed and coded manually by individuals who are unfamiliar with the cases - but for an average case, the process of coding related legal documents can take weeks to complete and is very expensive. After all that coding work is complete, at best attorneys will have a way of using keywords to search the handful of fields that were coded.
Ah, Bates numbers?
Yes, and other codes as well. But with Brainware, the need for document coding is significantly reduced and, in many cases, eliminated. The solution offers search capabilities on subject matters and context of the full text of each document. Using simple, natural language queries, attorneys can immediately find the pertinent details they need-regardless of the size of the document collection. Documents can also be quickly searched for specific issues using complete sentences and even paragraphs, rather than simply using keywords that are connected by Boolean operators.
In just about every lawsuit, attorneys know about at least one "hot" document and need to find other similar or related documents. With our search application, attorneys can use whole sections of such a document, or even the entire document, as the search query against all other documents in the case.
Our clients include such major legal firms as Fulbright & Jaworski, LLP, which a making the most of these capabilities and really helping their clients with their legal matters. We've heard that our system gives attorneys more time to work on analysis and less flipping through files looking for a specific item.
When I tested your system on my test corpus, I was startled at its throughput rate?
We hear this quite a bit. Brainware is a quick install and a quick document processor.
Remember, however, you tested our desktop application which is significantly slower than our enterprise version. The desktop application runs on a single CPU and does not distribute the workload. Our enterprise version splits the indexing function into any number of processors and, therefore, provides indexing times that are a small fraction of the indexing times you experiences with our desktop version.
The indexing time depends on the types of documents being processed. If a majority or all of the documents are image files the process would be slower.
As you know the OCR process can be CPU intensive and slow down the indexing. However, like all processes with Globalbrain Enterprise Edition, the indexing can be distributed across multiple servers and CPUs.
The search times typically seen are under 1 second. One of our customers has over two terabytes of data, and their Brainware system can process a chunk of text submitted as a query and receive a responseless than two seconds.
What's the Brainware plumbing? By that I mean architecture and other platform components.
All Globalbrain Enterprise Edition executable components can run on Unix and Windows environments including Sun Solaris, Linux, Windows NT, Windows 2000, and Windows 2003.
The software and services are able to operate in a distributed environment and can easily scale to handle terabytes of data and thousands of users.
Brainware sells the Globalbrain Enterprise Edition directly to customers and partners. We do not offer the software as a service.
When I update indexes on some systems, I really have to reindex the corpus. What is the index update system used in Brainware? Can I maintain separate collections? Can I search across collections, which some people call federation?
Let's talk about indexing.
Globalbrain performs an incremental update in the background by looking for new, modified or deleted documents and updating the core index rather than having to re-index it. Users do not even know that an update is happening because there is no loss in functionality while an update is being performed.
And, yes, we support collections, but we call these "knowledgebases," which is our version of an index. Each knowledgebase is a container of separate data collections for searching. With Globalbrain Enterprise Edition, each of the knowledgebases can be maintained individually and have different security allowing different users to have limited access to parts of the knowledgebase or the entire knowledgebase.
As long as a user has the appropriate access rights, they can search across all knowledgebases or selected knowledgebases with one query using our intelligent search technology to search these different collections of data. Some other search applications also point to different data repositories stored in different applications but rely on a federated search methodology, i.e., the different built-in search functions of each separate application.
A federated approach simply gathers and presents the results of multiple other search applications into a single user interface. Globalbrain uses its own search functionality to search through any number of diverse data repositories.
Does Brainware seek to replace existing search systems, or do you complement the competitors' systems?
Our system is, actually, compatible with those of the providers you mentioned. Rather than competing with them and other players in this field, we are offering Globalbrain as a complementary, embedded solution that offers fuzzy, keyword-free capabilities the above vendors lack. We have a set of robust APIs that allow Globalbrain to seamlessly work behind the scenes. Users can make the most of it and still have the familiar look-and-feel.
What are some of the new features and functions that I can expect from Brainware in the next release?
That's a good question. Please, appreciate that I can only describe a few enhancements and in a general way.
One of the new functions will be support for traditional double-byte languages such as languages based on CJK (Chinese, Japanese, and Korean ) character sets. This new feature and our approach will support "fuzziness". Our system is able to support knowledgebases that contain multiple languages without any issues.
We are also adding what some people call "facets" or "assisted navigation". Because at indexing time Globalbrain also grabs document attribute information, users will now be able to quickly drill down or filter their search results based on some of these attributes.
Is this a Yahoo-style interface with categories?
Well, not exactly Yahoo. The interface will be more like those available from such vendors as Endeca, Fast Search & Transfer, and others.
We are also adding field weighting. At the knowledgebase level an administrator can configure the system to weight a result based on the location of the "hot spot" text of a query result. An example of this would be within a library. If a user searches for "Civil War" they library may want to have a higher weighting for books that contain "Civil War" within the title.
Are you modifying your index?
No. We have a flexible design, and we have created what we call an "Intelligent Inverted Index". We wanted to continue having the unique fuzzy content based searching capability, but also be able to use traditional search techniques for handling very common keyword type queries. So we have introduced an intelligent inverted index within our underlying search technology that has not significantly increased our index size. This new approach will determine at query time on which search approach to use, either fuzzy/content, or inverted index. The approaches can be combined for queries consisting of a phrase or more of text that contain one or more very common keywords. The algorithms will identify common keyword documents within the knowledgebase immediately through a phase zero inverted index, and then use the remaining results for fuzzy/content search capabilities.
The turmoil in the search and content processing sector is increasing. There's financial pressure; mergers are getting headlines; and companies like Entopia are going out of business. What do you see as the major trends in search and content processing in the last half of 2008 and the first half of 2009?
I think we are likely to see the accelerated consolidation in this sector. Eventually, the market will be dominated by such major players like Google, IBM and Microsoft. I think you call these superplatforms, right?
Independent enterprise search solutions providers will probably be absorbed. With this in mind, we offer our search solutions primarily as embedded and complementary add-ons to the leading enterprise search products on the market.
Brainware is profiled in my new study Beyond Search. I included this company because its technology allows users to discover information, not just search for words and phrases. The work flow component and Brainware's commitment to "playing well with others" positions the firm to remediate some of the egregious problems that exist with many search-and-retrieval systems. The system's ability to locate relevant documents in a patent collection is quite useful. ArnoldIT has added Brainware to its search tool kit. More information about the company's system can be found at the Brainware website.
Stephen E. Arnold, March 31, 2008