Exclusive Interview with MaxxCat
April 15, 2009
I spoke with Jim Jackson on April 14, 2009. Maxxcat is a search and content processing vendor delivering appliance solutions. The full text of the interview appears below:
Why another appliance to address a search and content processing problem?
At MaxxCat, we believe that from the performance and cost perspectives, appliance based computing provides the best overall value. The GSA and Google Mini are the market leaders, but provide only moderate performance at an expensive price point. We believe that by continuously obsessing about performance in the three major dimensions of search (volume of data, speed of retrieval, and crawl/indexing times), our appliances will continue to improve. Software only solutions can not match the performance of our appliances. Nor can software only, or general purpose hardware approaches provide the scaling, high availability or ease of use of a gray-box appliance. From an overall cost perspective, even free software such as Lucene, may end up being more expensive than our drop-in and use appliance.
Jim Jackson, Maxxcat
A second factor that is growing more important is the ease of integration of the appliance. Many of our customers have found unique and unexpected uses for our appliances that would have been very difficult to implement with black box architectures like Google’s. Our entry level appliance can be set up in 3 minutes, comes with a quick start guide that is only 12 pages long, and can be administered from two simple browser pages. That’s it! Conversely, software such as Lucene has to be downloaded, configured, installed, understood, matched with suitable hardware. This is typically followed by a steep learning curve and consulting fees from experts who are involved in getting a working solution, which sometimes doesn’t work, or won’t scale.
But just because the appliance features easy integration, this does not mean that complex tasks cannot be accomplished with it. To aid our customers in integrating our appliances with their computing environments, we expose most of the features of the appliance to a web API. The appliance can be started, stopped, backed up, queried, pointed at content, SNMP monitored, and reported upon by external applications. This greatly eases the burden on developers who wish to customize the output, crawl behavior and integration points with our appliance. Of course this level of customization is available with open source software solutions, but at what price? And most other hardware appliances do not expose the hardware and operating system to manipulation.
Throughput becomes an issue eventually. What are the scaling options you offer
Throughput is our major concern. Even our entry level appliance offers impressive performance using, for the most part, general purpose hardware. We have developed a micro-kernel architecture that scales from our SB-250 appliance all the way through our 6 enterprise models. Our clustering technology has been built to deliver performance over a wide range of the three dimensions that I mentioned before. Some customers have huge volumes of data that are updated and queried relatively infrequently. Our EX-5700 appliance runs the MaxxCAT kernel in a horizontal, front-facing cluster mode sitting on top of our proprietary SAN; storage heavy, adequate performance for retrieval. Other customers may have very high search volumes on relatively smaller data sets (< 1 Exabyte). In this case, the MaxxCAT kernel runs the nodes in a stacked cluster for maximum parallelism of retrieval. Same operating system, same search hardware, same query language, same configuration files etc, but two very different applications. Both heavy usage cases, but heavy in different dimensions. So I guess the point I am trying to make is that you can say a system scales, but does it scale well in all dimensions, or can you just throw storage on it? The MaxxCAT is the only appliance that we know of that offers multiple clustering paradigms from a single kernel. And by the way, with the simple flick of a switch on one of the two administration screens I mentioned before, the clusters can be converted to H/A, with symmetric load balancing, automatic fault detection, recovery and fail over.
Where the the idea for the MaxxCat solution originate?
Maxxcat was inspired by the growing frustration with the intrinsic limitations of the GSA and Google Mini. We were hearing lamentations in the market place with respect to pricing, licensing, uptime, performance and integration. So…we seized the opportunity to build a very fast, inexpensive enterprise search capability that was much more open, and easier to integrate using the latest web technologies and general purpose hardware. Originally, we had conceived it as a single stand alone appliance, but as we moved from alpha development to beta we realized that our core search kernel and algorithms would scale to much more complex computing topologies. This is why we began work on the clustering, H/A and SAN interfaces that have resulted in the EX-5000 series of appliances.
What’s a use case for your system?
I am going to answer your question twice, for the same price. One of our customers had an application in which they had to continuously scan literally hundreds of millions of documents for certain phrases as part of a service that they were providing to their customers, and marry that data with a structured database. The solution they had in place before working with us was a cobbled together mish mash of SQL databases, expensive server platforms and proprietary software. They were using MS SQLServer to do full text searching, which is a performance disaster. They had queries that were running on very high end Dell quad core servers maxed out with memory that were taking 22 hours to process. Our entry level enterprise appliance is now doing those same queries in under 10 minutes, but the excitement doesn’t stop there. Because our architecture is so open, they were able to structure the output of the MaxxCAT into SQL statements that were fed back into their application and eliminate 6 pieces of hardware and two databases. And now, for the free, second answer. We are working with a consortium of publishers who all have very large volumes of data, but in widely varying formats, locations and platforms. By using a MaxxCAT cluster, we are able to provide these customers, not customers from different divisions of the same company, but different companies, with unified access to their pooled data. So the benefits in both of these cases is performance, economy, time to market, and ease of implementation.
Where did the name “MaxxCat” come from?
There are three (at least) versions of the story, and I do not feel empowered to arbitrate between the factions. The acronym comes from Content Addressable Technology, an old CS/EE term. Most computer memories work by presenting the memory with an address, and the memory retrieves the content. Our system works in reverse, the system is presented with content, and the addresses are found. A rival group, consisting primarily of Guitar Hero players, claims that the name evokes a double x fast version of the Unix ‘cat’ command (wouldn’t MaxxGrep have been more appropriate?). And the final faction, consisting primarily of our low level programmers claim that the name came from a very fast female cat, named Max who sometimes shows up at our offices. I will make as many friends as enemies if I were to reveal my own leanings. Meow.
What’s the product line up today?
Our entry level appliance is the SB-250, and starts at a price point of $1,995. It can handle up to several million web pages or documents, depending upon size. None of our appliances have artificial license restrictions based upon silly things like document counts. We then have 6 models of our EX-5000 enterprise appliances that are configured in ever increasing numbers of nodes, storage, and throughput. We really try to understand a customer’s application before making a recommendation, and prefer to do proofs of concept with the customer’s actual data, because, as any good search practitioner can tell you, the devil is in the data.
8. What is the technical approach of your search and content processing system?
We are most concerned with performance, scalability and ease of use. First of all, we try to keep things as simple as possible, and if complexity is necessary, we try to bury it in the appliance, rather than making the customer deal with it. A note on performance; our approach has been to start with general purpose hardware and a basic Linux configuration. We then threw out most of Linux, and built our operating system that attempts to take advantage of every small detail we know about search. A general purpose Linux machine has been designed to run databases, run graphics applications, handle network routing, sharing and interface to a wide range of devices and so forth. It is sort of good at all of them, but not built from the ground up for any one of them. This fact is part of the beauty of building a hardware appliance dedicated to one function — we can throw out most of the operating system that does things like network routing, process scheduling, user accounting and so forth, and make the hardware scream through only the things that are pertinent to search. We are also obsessive about what may seem to be picayune details to most other software developers. We have meetings where each line of code is reviewed and developers are berated for using one more byte or one more cycle than necessary. If you watch the picoseconds, the microseconds will take care of themselves.
A lot of our development methodology would be anathema to other software firms. We could not care less about portability or platform independence. Object oriented is a wonderful idea, unless it costs one extra byte or cycle. We literally have search algorithms that are so obscure, they take advantage of the Endianess of the platform. When we want to do something fast, we go back to Knuth, Salton and Hartmanis, rather than reading about the latest greatest on the net. We are very focused on keeping things small, fast, and tight. If we have a choice between adding a feature or taking one out, it is nearly unanimous to take it out. We are all infected with the joy of making code fast and small. You might ask, “Isn’t that what optimizing compilers do”. You would be laughed out of our building. Optimizing compilers are not aware of the meta algorithms, the operating system threading, the file system structure and the like. We consider an assembler a high level programming tool, sort of. Unlike Microsoft Operating systems which keep getting bigger and slower, we are on a quest to make ours smaller, faster. We are not satisfied yet, and maybe we won’t ever get there. Hardware is changing really fast too, so the opportunities continue.
How has the Google Search Appliance affected the market for your firm’s appliance?
I think that the marketing and demand generation done by Google for the GSA is helping to create demand and awareness for enterprise search, which helps us. Usually, especially on the higher end of the spectrum, people who are considering a GSA will shop a little, or when they come back with the price tag, their boss will tell them “What??? Shop This!”. They are very happy when they find out about us. What we share with Google is a belief in box based search (they advocate a totally closed black box, we have a gray box philosophy where we hide what you don’t need to know about, but expose what you do). Both of our companies have realized the benefits of dedicating hardware to a special task using low cost, mass produced components to build a platform. Google offers massive brand awareness and a giant company (dare I say bureaucracy). We offer our customers a higher performing, lower cost, extensible platform that makes it very easy to do things that are very difficult with the Google Mini or GSA.
What hooks / services does your API offer?
Every function that is available from the browser based user interface is exported through the API. In fact, our front end runs on top of the API, so customers who are so inclined to do so could rewrite or re-organize the management console. Using the API, detailed machine status can be obtained. Things such as core temperature, queries per minute, available disk space, current crawl stats, errors and console logs are all at the user’s fingertips. Furthermore, collections can be added, dropped, scheduled and downloaded through the API. Our configuration and query languages are simple, text based protocols, and users can use text editors or software to generate and manipulate the control structures. Don’t like how fast the MaxxCAT is crawling your intranet, or when? Control it with external scheduling software. We don’t want to build that and make you learn how to use it. Use Unix cron for that if that’s what you like and are used to. For security reasons, do you want to suspend query processing during non-business hours? No problem. Do it from a browser or do it from a mainframe.
We also offer a number of protocol connectors to talk to external systems — HTTP, HTTPS, NFS, FTP, ODBC. And we can import the most common document formats, and provide a mechanism for customers to integrate additional format connectors. We have licensed a very slick technology for indexing ODBC databases. A template can be created to create pages from the database and the template can be included in the MaxxCAT control file. When it is time to update say, the invoice collection, the MaxxCAT can talk directly to the legacy system and pull the required records (or those that have changed or any other SQL selectable parameters), and format them as actual documents prior to indexing. This takes a lot of work off of the integration team. Databases are traditionally tricky to index, but we really like this solution.
With respect to customizing output, we emit a standard JSON object that contains the result and provide a simple templating language to format those results. If users want to integrate the results with SSIs or external systems, it is very straightforward to pass this data around, and to manipulate it. This is one area where we excel against Google, which only provides a very clunky XML output format that is server based, and hard to work with. Our appliance can literally become a sub-routine in somebody else’s system.
What are new features and functions added since the last point release of your product?
Our 3.2 OS (not yet released) will provide improved indexing performance, a handful of new API methods, and most exciting for us, a template based ODBC extractor that should make pulling data out of SQL databases a breeze for our customers. We also have scheduled toggle-switch H/A, but that may take a little more time to make it completely transparent to the users.
13. Consolidation and vendors going out of business like SurfRay seem to be a feature of the search sector. How will these business conditions affect your company?
Another strange thing about MaxxCAT, in addition to our iconoclastic development methods is our capital structure. Unlike most technology companies, especially young ones, we live off of revenue, not equity infusions. And we carry no debt. So we are somewhat insulated from the current downturn in the capital markets, and intend to survive on customers, not investors. Our major focus is to make our appliances better and faster. Although we like to be involved in the evaluation process with our customers, in all but the most difficult of cases, we prefer to hand off the implementation to partners who are familiar with our capabilities and who can bring in-depth enterprise search know how into the mix.
Where do I go to get more information?
Visit www.maxxcat.com or email sales@maxxcat.com
Stephen Arnold, April 15, 2009
Comments
2 Responses to “Exclusive Interview with MaxxCat”
Please note that the heading of item 13. is not correct and should be retracted. SurfRay is not out of business, it is however under new management. If you would like up-to-date information on the company please visit the website at http://www.surfray.com.
Josh Noble,
Your comment adapts the interview text. Please, pass your concern along to Maxxcat. I provide the questions; the answers are the interview subject’s. I have no way of knowing if Search Wizards offer accurate information.
Stephen Arnold, April 28, 2009