Kickfire: Crunching Data Quickly
April 15, 2009
Reuters published “Kickfire Launches First Analytic Appliance for the Data Warehousing Mass Market” here. On the surface, Kickfire seems to be another player in the data management enabling business. Digging into the company’s Web site here reveals another angle–Kickfire is consumerizing high-end methods. The Reuters’ article said:
Kickfire’s revolutionary appliance delivers the industry`s highest performance per dollar spent. Starting at just $32,000, it makes the high-end capabilities of commercial database systems available to every organization. Combining the industry`s first SQL chip… and an advanced column-store engine with full ACID compliance, it achieves the industry`s best price/performance based on rigorous benchmark tests and ensures complete data integrity.
In my opinion, the Kickfire approach blends several innovations. First, the company uses proprietary silicon packaged in an appliance. Cloud based consumer business analytics are slowly gaining traction. The Kickfire appliance is a here-and-now solution. Second, the appliance eliminates some (not all) of the headaches associated with scaling for industry standard number crunching methods. The performance of the Kickfire appliance becomes available without choking other enterprise systems. Finally, Kickfire implements some data structure methods that breathes life into traditional Codd tables.
Kickfire, a privately-held firm, is backed by blue-chip venture capital firms: Accel Partners, Greylock Partners, The Mayfield Fund and Pinnacle Ventures.
Stephen Arnold, April 15, 2009
The Google: Scores a Big Win
April 15, 2009
The goslings and I have been quite busy at the goose pond today. A happy quack to the reader in the UK who alerted me to ZDNet.co.uk’s story “Virgin to Migrate Customers onto Google Mail.” You can read the story here.
Colin Barker wrote:
The company said the rollout will be one of the largest deployments to date of Google Partner Edition Apps, which lets businesses and individual customers use Google’s communication and collaboration applications under their own domain names.
I think this announcement is a big deal. First, Virgin is a high profile company and the top Virgin is an executive who gets quite a bit of attention in major companies. Second, this deal makes clear that it makes financial and technical sense for organizations to get out of the email business. Email has become complex and costly. Organizations like Virgin looked at the facts and made a decision to go with Googzilla. Smart choice. If litigation becomes necessary. The GOOG is in the archiving business too. The company doesn’t call much attention to its Postini-centric solution, but it is there and promises to slash the cost of some discovery actions.
What the Gmail deal means to this addled goose is that the Google Apps initiative is going to find increasingly attractive opportunities. Will Virgin stop at email? My hunch is that Virgin will be an interesting Google customer to watch. I give more detail about what can be done with the Google App Engine in my next column in KMWorld.
So, this is a big deal.
Stephen Arnold, April 15, 2009
Google: The Tapeworm
April 15, 2009
I enjoy the New York Times. I find the write ups a good mix of big city hipness and “we know more than you” information about major trends in the world. The editorials are fun too. Having worked at a daily paper and a big magazine publisher, I know how some of the ideas take shape. Outside contributions are useful as well. Getting a chunk of column inches can do wonders for a career or an idea.
I liked “Dinosaur at the Gate” here. The author is Maureen Dowd. She summarizes big media’s view of the GOOG. The image of “tapeworm” was fresh and amusing. I never thought of math as having tapeworm qualities, but I am an addled goose.
The bottomline is that this write up will probably have quite a bit of influence in shaping the dialog about Googzilla, a term I coined when on a panel with a Googler in 2005. The Googler laughed at my image of a smiling Googzilla as did the audience. I used the term affectionately in 2005. Then Googzilla was at the gate. Today Googzilla is in the city, kicking back at TGIF, sipping a mojito.
More about its influence within the core of the information world appears in Google: The Digital Gutenberg here. By the way, Google definitely has some characteristics of middleware, but it is more. Much, much more. I think Google is a positive in today’s information world, and I urge readers to consider “surfing on Google”. If this phrase doesn’t make sense, check out my two Google monographs, dating from 2005 here.
Stephen Arnold, April 15, 2009
More Yahoo Crazy Math: Microsoft Yahoo Analytics
April 15, 2009
Silicon Alley Insider’s “Yahoo Could Save $1+ Billion per Year Outsourcing Search to Microsoft” was one of those Web write ups that went into my “Classics” folder. The author (Dan Frommer) summarized one of the pundits who analyzes the heck out of Web outfits. That analysis provided the fodder for this stallion of a column. The gist of the argument is that Yahoo could save money by paying Microsoft to run its search system. Yahoo has had a go at this. Before the meltdown or financial missteps of Fast Search & Transfer, Yahoo was relying on Fast Search and its data centers to provide support to the Yahoo search wizards and wizardettes, according to my sources. The financial pundit seized upon a similar idea, swapped out Fast Search for Microsoft, and presto we have a hot new angle on Yahoo.
What this story triggered in my mind were these thoughts:
- Not much has changed at Yahoo in the search department. When a financial analyst realizes that Yahoo’s investment in search is sucking up its oxygen, it may be too late to resuscitate the purple beastie
- Microsoft has plumbing, but I wonder if that plumbing can handle the demands of Yahoo’s spider which gobbles more of my Web site’s content than any other indexing system that hits it. With talk about chips as a solution to Microsoft’s performance problems, is this porting of Yahoo to the Microsoft infrastructure affordable, possible, practical, or even doable?
- As Google’s share of the Web search market creeps toward 70 or 80 percent, Microsoft and Yahoo have to do more than team up. The companies–on their own or in some sort of tie up–have to leapfrog over Googzilla. A direct clash is likely to leave both Microsoft and Yahoo battered and not much better off than each company is at the present time.
In fact, the word “time” is interesting. I think “time” for Microsoft and Yahoo with regards to Google is running out. Quickly. In Web search.
Stephen Arnold, April 15, 2009
Exclusive Interview with MaxxCat
April 15, 2009
I spoke with Jim Jackson on April 14, 2009. Maxxcat is a search and content processing vendor delivering appliance solutions. The full text of the interview appears below:
Why another appliance to address a search and content processing problem?
At MaxxCat, we believe that from the performance and cost perspectives, appliance based computing provides the best overall value. The GSA and Google Mini are the market leaders, but provide only moderate performance at an expensive price point. We believe that by continuously obsessing about performance in the three major dimensions of search (volume of data, speed of retrieval, and crawl/indexing times), our appliances will continue to improve. Software only solutions can not match the performance of our appliances. Nor can software only, or general purpose hardware approaches provide the scaling, high availability or ease of use of a gray-box appliance. From an overall cost perspective, even free software such as Lucene, may end up being more expensive than our drop-in and use appliance.
Jim Jackson, Maxxcat
A second factor that is growing more important is the ease of integration of the appliance. Many of our customers have found unique and unexpected uses for our appliances that would have been very difficult to implement with black box architectures like Google’s. Our entry level appliance can be set up in 3 minutes, comes with a quick start guide that is only 12 pages long, and can be administered from two simple browser pages. That’s it! Conversely, software such as Lucene has to be downloaded, configured, installed, understood, matched with suitable hardware. This is typically followed by a steep learning curve and consulting fees from experts who are involved in getting a working solution, which sometimes doesn’t work, or won’t scale.
But just because the appliance features easy integration, this does not mean that complex tasks cannot be accomplished with it. To aid our customers in integrating our appliances with their computing environments, we expose most of the features of the appliance to a web API. The appliance can be started, stopped, backed up, queried, pointed at content, SNMP monitored, and reported upon by external applications. This greatly eases the burden on developers who wish to customize the output, crawl behavior and integration points with our appliance. Of course this level of customization is available with open source software solutions, but at what price? And most other hardware appliances do not expose the hardware and operating system to manipulation.
Throughput becomes an issue eventually. What are the scaling options you offer
Throughput is our major concern. Even our entry level appliance offers impressive performance using, for the most part, general purpose hardware. We have developed a micro-kernel architecture that scales from our SB-250 appliance all the way through our 6 enterprise models. Our clustering technology has been built to deliver performance over a wide range of the three dimensions that I mentioned before. Some customers have huge volumes of data that are updated and queried relatively infrequently. Our EX-5700 appliance runs the MaxxCAT kernel in a horizontal, front-facing cluster mode sitting on top of our proprietary SAN; storage heavy, adequate performance for retrieval. Other customers may have very high search volumes on relatively smaller data sets (< 1 Exabyte). In this case, the MaxxCAT kernel runs the nodes in a stacked cluster for maximum parallelism of retrieval. Same operating system, same search hardware, same query language, same configuration files etc, but two very different applications. Both heavy usage cases, but heavy in different dimensions. So I guess the point I am trying to make is that you can say a system scales, but does it scale well in all dimensions, or can you just throw storage on it? The MaxxCAT is the only appliance that we know of that offers multiple clustering paradigms from a single kernel. And by the way, with the simple flick of a switch on one of the two administration screens I mentioned before, the clusters can be converted to H/A, with symmetric load balancing, automatic fault detection, recovery and fail over.
Where the the idea for the MaxxCat solution originate?
Maxxcat was inspired by the growing frustration with the intrinsic limitations of the GSA and Google Mini. We were hearing lamentations in the market place with respect to pricing, licensing, uptime, performance and integration. So…we seized the opportunity to build a very fast, inexpensive enterprise search capability that was much more open, and easier to integrate using the latest web technologies and general purpose hardware. Originally, we had conceived it as a single stand alone appliance, but as we moved from alpha development to beta we realized that our core search kernel and algorithms would scale to much more complex computing topologies. This is why we began work on the clustering, H/A and SAN interfaces that have resulted in the EX-5000 series of appliances.
What’s a use case for your system?
I am going to answer your question twice, for the same price. One of our customers had an application in which they had to continuously scan literally hundreds of millions of documents for certain phrases as part of a service that they were providing to their customers, and marry that data with a structured database. The solution they had in place before working with us was a cobbled together mish mash of SQL databases, expensive server platforms and proprietary software. They were using MS SQLServer to do full text searching, which is a performance disaster. They had queries that were running on very high end Dell quad core servers maxed out with memory that were taking 22 hours to process. Our entry level enterprise appliance is now doing those same queries in under 10 minutes, but the excitement doesn’t stop there. Because our architecture is so open, they were able to structure the output of the MaxxCAT into SQL statements that were fed back into their application and eliminate 6 pieces of hardware and two databases. And now, for the free, second answer. We are working with a consortium of publishers who all have very large volumes of data, but in widely varying formats, locations and platforms. By using a MaxxCAT cluster, we are able to provide these customers, not customers from different divisions of the same company, but different companies, with unified access to their pooled data. So the benefits in both of these cases is performance, economy, time to market, and ease of implementation.
Where did the name “MaxxCat” come from?
There are three (at least) versions of the story, and I do not feel empowered to arbitrate between the factions. The acronym comes from Content Addressable Technology, an old CS/EE term. Most computer memories work by presenting the memory with an address, and the memory retrieves the content. Our system works in reverse, the system is presented with content, and the addresses are found. A rival group, consisting primarily of Guitar Hero players, claims that the name evokes a double x fast version of the Unix ‘cat’ command (wouldn’t MaxxGrep have been more appropriate?). And the final faction, consisting primarily of our low level programmers claim that the name came from a very fast female cat, named Max who sometimes shows up at our offices. I will make as many friends as enemies if I were to reveal my own leanings. Meow.
What’s the product line up today?
Our entry level appliance is the SB-250, and starts at a price point of $1,995. It can handle up to several million web pages or documents, depending upon size. None of our appliances have artificial license restrictions based upon silly things like document counts. We then have 6 models of our EX-5000 enterprise appliances that are configured in ever increasing numbers of nodes, storage, and throughput. We really try to understand a customer’s application before making a recommendation, and prefer to do proofs of concept with the customer’s actual data, because, as any good search practitioner can tell you, the devil is in the data.
8. What is the technical approach of your search and content processing system?
We are most concerned with performance, scalability and ease of use. First of all, we try to keep things as simple as possible, and if complexity is necessary, we try to bury it in the appliance, rather than making the customer deal with it. A note on performance; our approach has been to start with general purpose hardware and a basic Linux configuration. We then threw out most of Linux, and built our operating system that attempts to take advantage of every small detail we know about search. A general purpose Linux machine has been designed to run databases, run graphics applications, handle network routing, sharing and interface to a wide range of devices and so forth. It is sort of good at all of them, but not built from the ground up for any one of them. This fact is part of the beauty of building a hardware appliance dedicated to one function — we can throw out most of the operating system that does things like network routing, process scheduling, user accounting and so forth, and make the hardware scream through only the things that are pertinent to search. We are also obsessive about what may seem to be picayune details to most other software developers. We have meetings where each line of code is reviewed and developers are berated for using one more byte or one more cycle than necessary. If you watch the picoseconds, the microseconds will take care of themselves.
A lot of our development methodology would be anathema to other software firms. We could not care less about portability or platform independence. Object oriented is a wonderful idea, unless it costs one extra byte or cycle. We literally have search algorithms that are so obscure, they take advantage of the Endianess of the platform. When we want to do something fast, we go back to Knuth, Salton and Hartmanis, rather than reading about the latest greatest on the net. We are very focused on keeping things small, fast, and tight. If we have a choice between adding a feature or taking one out, it is nearly unanimous to take it out. We are all infected with the joy of making code fast and small. You might ask, “Isn’t that what optimizing compilers do”. You would be laughed out of our building. Optimizing compilers are not aware of the meta algorithms, the operating system threading, the file system structure and the like. We consider an assembler a high level programming tool, sort of. Unlike Microsoft Operating systems which keep getting bigger and slower, we are on a quest to make ours smaller, faster. We are not satisfied yet, and maybe we won’t ever get there. Hardware is changing really fast too, so the opportunities continue.
How has the Google Search Appliance affected the market for your firm’s appliance?
I think that the marketing and demand generation done by Google for the GSA is helping to create demand and awareness for enterprise search, which helps us. Usually, especially on the higher end of the spectrum, people who are considering a GSA will shop a little, or when they come back with the price tag, their boss will tell them “What??? Shop This!”. They are very happy when they find out about us. What we share with Google is a belief in box based search (they advocate a totally closed black box, we have a gray box philosophy where we hide what you don’t need to know about, but expose what you do). Both of our companies have realized the benefits of dedicating hardware to a special task using low cost, mass produced components to build a platform. Google offers massive brand awareness and a giant company (dare I say bureaucracy). We offer our customers a higher performing, lower cost, extensible platform that makes it very easy to do things that are very difficult with the Google Mini or GSA.
What hooks / services does your API offer?
Every function that is available from the browser based user interface is exported through the API. In fact, our front end runs on top of the API, so customers who are so inclined to do so could rewrite or re-organize the management console. Using the API, detailed machine status can be obtained. Things such as core temperature, queries per minute, available disk space, current crawl stats, errors and console logs are all at the user’s fingertips. Furthermore, collections can be added, dropped, scheduled and downloaded through the API. Our configuration and query languages are simple, text based protocols, and users can use text editors or software to generate and manipulate the control structures. Don’t like how fast the MaxxCAT is crawling your intranet, or when? Control it with external scheduling software. We don’t want to build that and make you learn how to use it. Use Unix cron for that if that’s what you like and are used to. For security reasons, do you want to suspend query processing during non-business hours? No problem. Do it from a browser or do it from a mainframe.
We also offer a number of protocol connectors to talk to external systems — HTTP, HTTPS, NFS, FTP, ODBC. And we can import the most common document formats, and provide a mechanism for customers to integrate additional format connectors. We have licensed a very slick technology for indexing ODBC databases. A template can be created to create pages from the database and the template can be included in the MaxxCAT control file. When it is time to update say, the invoice collection, the MaxxCAT can talk directly to the legacy system and pull the required records (or those that have changed or any other SQL selectable parameters), and format them as actual documents prior to indexing. This takes a lot of work off of the integration team. Databases are traditionally tricky to index, but we really like this solution.
With respect to customizing output, we emit a standard JSON object that contains the result and provide a simple templating language to format those results. If users want to integrate the results with SSIs or external systems, it is very straightforward to pass this data around, and to manipulate it. This is one area where we excel against Google, which only provides a very clunky XML output format that is server based, and hard to work with. Our appliance can literally become a sub-routine in somebody else’s system.
What are new features and functions added since the last point release of your product?
Our 3.2 OS (not yet released) will provide improved indexing performance, a handful of new API methods, and most exciting for us, a template based ODBC extractor that should make pulling data out of SQL databases a breeze for our customers. We also have scheduled toggle-switch H/A, but that may take a little more time to make it completely transparent to the users.
13. Consolidation and vendors going out of business like SurfRay seem to be a feature of the search sector. How will these business conditions affect your company?
Another strange thing about MaxxCAT, in addition to our iconoclastic development methods is our capital structure. Unlike most technology companies, especially young ones, we live off of revenue, not equity infusions. And we carry no debt. So we are somewhat insulated from the current downturn in the capital markets, and intend to survive on customers, not investors. Our major focus is to make our appliances better and faster. Although we like to be involved in the evaluation process with our customers, in all but the most difficult of cases, we prefer to hand off the implementation to partners who are familiar with our capabilities and who can bring in-depth enterprise search know how into the mix.
Where do I go to get more information?
Visit www.maxxcat.com or email sales@maxxcat.com
Stephen Arnold, April 15, 2009
Google and Its Red Ink Geyser
April 15, 2009
Internet Evolution’s David Silversmith wrote “Google Losing up to $1.65M a Day on YouTube”. You can read it here. I would have slapped the title “So You Want to Be a Video Search Service?” I am not sure if the numbers are spot on. Talk about the Google’s losing $400 million a year or more has been floating around for quite a while. The point is that it is expensive to acquire video. host it, index it, and serve it. Not even Googzilla can deal with these costs. Hence, the new love birds: Googzilla and Universal.
Stephen Arnold, April 15, 2009
The Data Management Mismatch
April 15, 2009
I used to play table tennis in tournaments. Because table tennis is not America’s game, I found myself in matches with folks from other countries. I recall one evening in FAR on the Chambana campus I faced a fit Chinese fellow. We decided to hit a few and then play a match. In about 10 seconds, I realized that fellow was a squash player, and he had zero chance against me. There are gross similarities between squash and table tennis, but the context of each game is very different.
That’s the problem with describing one thing (ping pong) and squash (mainland China style). The words look similar and to the naive, the words may mean the same thing.
Now the data management mismatch. You can read a summary of a “controversial” report that pits the aging Codd database against the somewhat more modern MapReduce system. I describe the differences in my 2005 study The Google Legacy, and I won’t repeat them here.
Eric Lai’s “Researchers: Databases still beat Google’s MapReduce” here provides a good summary of this alleged face off. I am not going to dig into the entrails of this study nor the analysis by Mr. Lai. I do want to highlight this passage which caught my attention:
Databases “were significantly faster and required less code to implement each task, but took longer to tune and load the data,” the researchers write. Database clusters were between 3.1 and 6.5 times faster on a “variety of analytic tasks.” MapReduce also requires developers to write features or perform tasks manually that can be done automatically by most SQL databases, they wrote.
The paragraph makes clear that according to the wizards who ran this study, the Codd style database has some mileage left on it engine. I agree. In fact, I think some of the gurus at Google would agree as well.
What’s going on here is that the MapReduce system works really well for Google-scale, Google-type data operations for search and closely allied functions. When a Googler wants to crunch on a result set, the Googlers fire up a Codd database; for example, MySQL and do their thing.
Codd style databases can jump through hoops as well. But the point is that MapReduce makes certain types of large dataset tasks somewhat less costly to kit out.
I don’t think this is an either or. My research suggests that there is a growing interest in different types of data management systems. There are clever systems emerging from a number of companies. I have written about InfoBright, for instance.
I wrote a white paper with Sue Feldman which focused on a low profile Google project to tame dataspace. The notion is a step beyond Codd and MapReduce, yet dataspace has roots and shoots in both of these systems.
What we have is a mismatch? The capabilities of the systems are different. If I were to play the Chinese table tennis star in my basement, I would probably win. He would knock himself out on the hot water pipe that dips exactly where he steps to hit a forehand.
The context of the data management problem and the meaning of the words make a difference. Use the system that solves the problem.
Stephen Arnold, April 15, 2009
New Beyond Search Report about BA Insight
April 15, 2009
Whither Microsoft Enterprise Search?
Microsoft’s aggressive moves in the Enterprise Search space may be likened by some to a bull in china shop, but to my associate Paul Korzeniowski, Microsoft’s moves are based on a coherent strategy. Mr. Korzeniowski has written a white paper designed to clear up some of the fog surrounding this strategy and provide those users thinking of deploying Microsoft’s search products a clearer sense of some of the things to consider. You can download a copy here. You will need to register.
Stephen Arnold, April 15, 2009
Lou Rosenfeld on Content Architecture
April 15, 2009
Editor’s Note: The Boye 09 Conference in Philadelphia takes place the first week in May 2009, May 5 to May 7, 2009, to be precise. Attendees can choose from a number of special interest tracks. These include strategy and governance, Intranet, Web content management, SharePoint, user experience, and eHealth. You can get more information about this conference here. One of the featured speakers, is Lou Rosenfeld. You can get more information here. Janus Boye spoke with Mr. Rosenfeld on April 14, 2009. The full text of the interview appears below.
Why is it so hard for organizations to get a grip on user experience design?
Because UX is an interdisciplinary pursuit. In most organizations, the people who need to work together to develop good experiences–designers, developers, content authors, customer service personnel, business analysts, product managers, and more–currently work in separate silos. Bad idea. Worse, these people already have a hard time working together because they don’t speak the same language.
Once you get them all in the same place and help them to communicate better, they’ll figure out the rest.
Why is web analytics relevant when talking about user experience?
Web sites exist to achieve goals of some sort. UX people, for various reasons, rely on qualitative research methods to ensure their designs meet those goals. Conversely, Web analytics people rely on quantitative methods. Both are incomplete without the other – one helps you figure out what’s going on, the other why. UX and WA folks two more groups that need help communicating; I’m hoping my talk in some small way helps them see how they fit together.
Is your book “Information Architecture for the World Wide Web” still relevant 11 years later?
Nah, not the first edition from 1998. It was geared toward developing sites–and information architectures–from scratch. But the second edition, which came out in 2002, was almost a completely new book, much longer and geared toward tuning existing sites that were groaning under the weight of lots of content: good and bad, old and new. The third edition–which was more of a light update–came out in 2006. I don’t imagine information architecture will ever lose relevance as long as there’s content. In any case, O’Reilly has sold about 130,000 copies, so apparently they think our book is relevant.
Does Facebook actually offer a better user experience after the redesign?
I really don’t know. I used to find Facebook an excellent platform for playing Scrabble, but thanks to Hasbro’s legal department, the Facebook version of Scrabble has gone the way of all flesh. Actually, I think it’s back now, but I’ve gotten too busy to fall again to its temptation.
Sorry, that’s something of an underhanded swipe at Facebook. But now, as before, I find it too difficult to figure out. I have a hard time finding (and installing) applications that should be at my fingertips. I’m overwhelmed – and, sometimes, troubled–by all the notifications which seem to be at the core of the new design. I’d far prefer to keep up with people via Twitter (I’m @louisrosenfeld), which actually integrates quite elegantly with the other tools I already use to communicate, like my blog (http://louisrosenfeld.com) and email. But I’m the wrong person to ask. I’m not likely Facebook’s target audience. And frankly, my opinion here is worth what you paid for it. Much better to do even a lightweight user study to answer your question.
Why are you speaking at a Philadelphia web conference organized by a company based in Denmark?
Because they asked so nicely. And because I hope that someday they’ll bring me to their Danish event, so I can take my daughter to the original Legoland.
Janus Boye, April 15, 2009
MeFeedia: Video Your Way – Just Like a Burger
April 14, 2009
A happy quack to the reader who sent me a link to this news release wrapped in a Forbes.com package. The headline was “Multimedia Search Engine MeFeedia Brings Order to the Video Web” and you can read it here. The MeFeedia system provides these improvements:
- Layout
- Site performance. The story said, “The new site also loads three times as fast, due in part to its new tableless design and highly efficient multimedia search engine.”
- Navigation
The service provides access to video, TV shows, music, news, and movies.
My test queries returned some useful results. I did like the tag at the foot of each item in the results list that provided the source and other information about the video clip; for example, “Howcast – Most Recent Videos in Travel | howcast.com“.
The challenges any video search site faces are significant:
First, there’s the issue of deep pockets. It costs big piles of dollars and euros to pay for bandwidth and lawyers. Which consumes more money is up for grabs. I am not sure pre roll advertising will do the job for any video site.
Second, there’s the problem of marketing in the shadows of YouTube.com and the distant second place challenged Hulu.com. Even Google is opening a new video service with its pal Universal. More information about that deal is here.
The goslings and I want MeFeedia to succeed. Our query for geese returned this result, which is similar to the comments I get about my opinions expressed in this Web log by azure chipped consultants who are trying to earn a living as a “real” journalist.
Stephen Arnold, April 14, 2009