An Interview with Benno Nieswand
Exorbyte publishes high-performance search software for most databases and structured data formats. Exorbyte software offers a large array of phonetic and algorithmic fuzzy search methods while returning results under 10 ms across many millions of records. Founded in year 2000, the company's headquarters are in Konstanz, Germany.
The firm has a strong customer following in Europe (Mont Blanc, Deutsche Post, Billiger.de, among others) and and it is gaining market share in the US. The firm's technical team was among the first to implement auto-suggestion features in its interface. You can read the firm's white paper "AJAX Incremental Search Revisited" on this subject and learn more about the company's AJAX innovations.
On a recent trip to Europe, I sat down and talked with Benno Nieswand, the company founder and chief technical officer, in the firm's offices in Konstanz, Germany. The full text of the interview appears below:
What was the trigger for Exorbyte?
The idea for founding exorbyte came when I met Prof. Franz Guenthner - two people from different worlds - at RealNames in California, an internet start-up company, dealing with internet keywords. Prof. Guenthner had been working for AltaVista and other large search companies and is the head of the Centrum for Computational Linguistics in Munich. I, Benno Nieswand, worked as a software engineer for 10 years in the area of optical character recognition, neural nets, and automated address interpretation for postal applications.
With breakthroughs in matching algorithms at hand we realized the power of applying these to the internet. Together with Mr. Guenthner's linguistic background Exorbyte started in 2001 by productizing the ideas and selling the technology to several search engines among them Yahoo! Search Marketing (formerly Overture), Fast Search and Transfer, and Convera they required exceptionally fast and highly configurable error tolerance. Over the years Exorbyte developed a fully managed search solution for structured data. The development was driven by the aim to provide a highly scalable and highly configurable search solution for structured and semi-structured data. With regards to the combination of speed and the quality of our error tolerance we consider our solution an unmatched technology leader
What are the market segments on which you focus?
As I said before, we provide, MatchMaker, a search application for structured and semi-structured data (databases, XML, e-commerce catalogues, directories, CRM, etc.) and not for web content search or enterprise search. The market segments are therefore: e-commerce, data-quality, master-data management and OCR-matching. We don't see Endeca and Microsoft as our competitors (which I guess is reciprocal). Yes, they search databases as well, ans some better than others, but we offer more advanced, more transparent and easy to configure ranking (through our approximation capabilities) and more powerful string matching algorithms. And we even make these algorithms configurable, which they certainly don’t. That makes Exorbyte especially suited for applications that need to perform very intricate and specific lookups on vast amounts of structured data in very little time.
Enterprise search engines have specialized in developing connectors to all types of data (full-text, databases, and file formats) and they have developed business logic that pertains only to one type of application: making that data searchable, visible, and accessible to all sorts of users (each with their own rights and needs).
We have also developed such applications based on our core data matching platform (MatchMaker) but only for segments of the market where our expertise was really relevant. Sometimes that involves enhancing existing enterprise search platforms, but often it relates to very different business processes: e-commerce search, data quality management, fuzzy data detection (security and compliance), online advertizing targeting, and enhancing existing search platform's capabilities to handle structured data.
Will you describe a typical use case for your system?
Half of our implementations occur in back-office processes, where error-tolerance increases automation rates. One example: A healthcare claims processing center handles inbound documents (40,000 / day) and other processes (like electronic status inquiries) for over 120 health insurance plans. Matching claims with procedure codes, patient records, and other data types can be very difficult to fully automate. They saved 1 Million USD in two years by increasing the automation rates through our error-tolerant data matching with their central data repository.
The other half of our implementations are systems with user interaction, like e-commerce search for which we have developed leading search products such as the SearchNavigator, an incremental search AJAX framework.
A front-end example is the German finance ministry that centralized the information for all German tax IDs. Up to 2,000 operators can now simultaneously access 125 Million personal records with error tolerance. This allows operators to quickly identify similar candidates (especially useful for names with multiple possible German transcriptions, like Russian, Asian, or Arabian names). Various IT systems also query the central repository through MatchMaker. Each subsystem receives answers that are individually ranked and targeted for its special purpose.
Operational costs are optimized through the efficient handling of structured data (for example, 125 Million person data fit in only 6.5 Gb of RAM). Search engines that were designed for unstructured data typically require around 30 to 50 Gb of RAM. The second cost cutter: high speeds, despite excellent error tolerance on multiple fields. Some average query times: single field access on 5 Million data: 6ms. Multi-field access on 32 million data--200 milliseconds. This latter example is not fast for exact lookups, but very fast for lookups with a high error tolerance.
Finally, we are very quick in rolling out and configuring solutions for structured and semi-structured data. This is well illustrated with our AJAX framework SearchNavigator which allows integrating our system at the UI level without touching the underlying search or data infrastructure.
To optimize operational costs the system offers tools for advanced administration, multi-machine administration, configuration, benchmarking, and optimization.
Our technology supports low level semantic features like synonyms lookups, taxonomies, stemming or entity extraction features. With our focus on structured data we strive to stay language independent. We even support error-tolerant search in many different languages, including Chinese, Hindi, Japanese and other multibyte character sets.
Exorbyte advertizes its presence in Search as well as a few other domains (data management, data quality, OCR). Can you explain what these various areas have in common that allows Exorbyte to re-use the same technology for apparently very different processes?
Each of these areas profit from Exorbyte's powerful error tolerance and an efficient and flexible search in structured data. Our search application MatchMaker was designed from the outset with scalability and flexibility in mind. All the above-mentioned applications build on MatchMaker as the underlying search and matching application. For our data quality solution we added a scalable, web-based workflow system.
Our OCR/database matching and validation system considers the specifics of that industry. Take the matching of OCR results against a customer database or healthcare patient records: the query may be like "?teue? Amol?" and then MatchMaker returns "Steven Arnold" and can even decide whether to route this result automatically through the system, or whether an operator should double check.
Each of the applications you mention requires different search and matching capabilities. In addition every industry has its own requirements. No company has yet managed to build an universal AND automated data matching system, because each industry wants a system that automates domain specific features. National security requires approximate matching to thwart criminals’ efforts to stay invisible by subtle name changes; while finance and insurance companies often require very reliable matching with little error tolerance and solid verification checks because sending a check or patient records to the wrong person can cause costly consequences. E-commerce companies want to offer as many possible matches to a potential customer’s queries, ranked by popularity or profit-margin; while online-directories… etc.
What they all have in common is the need of matching error-prone input data as quickly as possible with a high amount of reference data. Exorbyte can serve all these customers with its unique toolbox of algorithms assembled to a whole, flexible solution.
What features have your customers found most useful?
Among the really outstanding features of MatchMaker is obviously the ultra-efficient implementation of the Levenshtein Edit Distance calculation for error-tolerant search which remains unrivalled over all these years. At the same time our implementation includes various adaptations of the Levenshtein and other more or less related algorithms. This allows for the highest conceivable degree of error tolerance.
Moreover, MatchMaker allows to combine string comparison with geographic distance, timeline distance, and any kind of custom distance into a single application that, even if configured with the highest and most complex algorithms, is still lightning fast and reliable.
Take for example Genialotel that shows an approximate multi-field AJAX incremental search on 32 Million addresses (SearchNavigator): nothing I know of globally comes even close to this at this point.
What enhancements have you made to the core system in the last year?
Over the last year Exorbyte went 64-bit, which led to a significant increase in speed of our core algorithms. In addition we added improvements in navigation generation (faceted search) and entity extraction.
Besides the continuous improvement of our search engine, Exorbyte developed a data quality solution that flexibly handles data quality tasks. It can be applied to processing address enhancement, deduplication, dictionary collection, document processing and more. The underlying MatchMaker search empowers it to achieve excellent results for each of these tasks.
Our core algorithms also were enhanced by the capability to match things like “nieswandkonstnzbenno” with “Benno Nieswand, Konstanz” which is called Block-Edit-Distance calculation pertaining the same speed as for regular Levenshtein calculation. This greatly improves single field entry support. We use this for CRM applications for instance.
Yes, and we greatly improved our search analytics and also implemented a service for the extraction of information from websites: Web Extraction Solution – WES.
What connectors do you provide with your system to reduce the information transformation challenges that other vendors' systems leave up to the licensees to resolve?
Deep integration of client applications is rarely necessary. With our focus on structured and semi-structured content we usually sit on top of the core system database. Standard ODBC or file interfaces are sufficient. Again, MatchMaker is not designed as an enterprise search systems which need to cater to these requirements (database connectors and file format compatibility). For structured and semi-structured data we support many standard extraction schemes including automatic keyword extraction, clustering, and transliteration for seamless integration of any alphabet. Especially Asian languages have been of interest, and we now even offer the possibility of mixed language data entry or indexed data, e.g. Japanese with Kanji, Katakana, and Romaji.
What types of hardware and operating systems does your system require?
MatchMaker uses standard hardware very efficiently and runs on Windows, Linux and Solaris. Although it keeps its index in memory in order to achieve its high performance, it significantly compresses the data (for example,. 125 million addresses in 6 Gb of RAM). We have installations that serve up to 600 queries per second on a standard dual-CPU server with millions of records indexed. More elaborate multi-field query matching naturally take longer.
It scales automatically with the hardware provided and can parallelize as many queries as necessary to meet any customer’s latency requirements.
New content is updated on the fly in the index through incremental updates with no downtime of the system.
In case the system in close to an overload, it can automatically reduce the search depth to increase query throughput. Or the administrator can assign additional CPU resources with a single mouse click. If the system sees no chance of handling a query in time, it immediately sends a busy message to the client application.
What degree of customization does your system permit?
One of the core competencies of Exorbyte is the transparent relevance ranking and configuration of the matching algorithms. You can configure how much relevance, popularity, geo-location, word frequency, and many other variables factor alone or combined into the ranking of result sets. These rules can be applied through data type specific algorithmic modules and other settings to the boundary conditions of the search. For instance, rankings can take into account match cases where a query’s words match one by one but in the wrong order, or where certain characters contribute to the match quality, or where part of the word shall be matched by the query (front, back, or middle). Certain words shall contribute more or less to relevance ranking, or certain fields should yield a penalty based on their value, etc. The list of possible rules combinations is somewhat endless. MatchMaker can be configured to use the best-of-breed algorithms just in the way that the current problem is addressed best. In this context hit boosting is just a means of relevance tuning like others.
Integration of the system is usually trivial, since the mechanisms I described for feeding data and query formulation are simple. Customers then integrate their systems to MatchMaker over a simple functional API provided in any client programming language. Sample integrations are delivered together with the installer. An average software engineer should be able to connect to a MatchMaker server within an hour. Many of our customers use MatchMaker with Omniture analytics.
What are the new features and functions you want to explore for the next major release of the Exorbyte system?
After big improvements in spring 2009 by using fully the 64-bit architecture, the next step is the release of MatchMaker 5.0 in spring 2010, which is a major step forward. It will support “approximate Boolean logic”. It will support new search and matching strategies, support even quicker categorization and navigation support for very large data sets, easier data modeling and expanded analytics. All this was added with no loss of overall performance and speed.
When you look at the sprawl of Google, what is your take on that firm's push into structured data and ecommerce?
Google usually seeks to address the masses with standardized products and does not address customers with highly specialized requirements. Not our focus. Exorbyte developed extensions in the past that allows for searching in a similar way as Google COMBINED with its own unique features that Google cannot provide, such as combination of multiple defective query portions and data specific matching algorithm. In addition exorbyte offers services to adapt its platform to each customer and maintain search projects as they become more and more complex. Our clients benefit materially from the enormous configurability and error tolerance of our solution.
What is the business set up for your company?
Exorbyte is privately owned. Our headquarters is located in the beautiful south of Germany at the Lake Constance, close to Zurich. There is a long history of technology development in this region and many OCR companies originate from here. We have offices in Portland (Oregon, US) and Bristol (UK).
Yes, you have a great location. When you look forward to 2010, what are the three major search related trends that you see as more important?
Yes, we see demand for search engines to power interactive applications. Our experience is that potential clients increasingly understand the value of specialized search applications after difficult experiences with generic applications which could not sufficiently accommodate their business models.
Also we see that search vendors who can quickly roll out and implement solutions are successful. As very specialized solutions can be implemented in ever shorter times, their benefits become interesting to a wider range of customers, in particular smaller customers.
This will leave space for emerging technology providers with high quality of service.
Exorbyte offers a number of innovative search systems. Yahoo advertisers use Exorbyte's system to determine which key words are useful to an ad campaign. The company's system can be configured to support eCommerce, database, and business intelligence systems. Organizations struggling with large data sets will want to test drive the Exorbyte system. Worth a close look.
Stephen E. Arnold, November 10, 2009