Autonomy IDOL Metrics

March 3, 2009

I was updating my files and noticed that the company had added metrics to its IDOL write up. You can find the information here. Among the information I noted were these points:

Support over 470 million documents on 64-bit platforms
Accurately index in excess of 110 GB/hour with guaranteed index commit times (I.e. how fast an asset can be queried after it is indexed) of sub 5ms
Execute over 2,600 queries per second, with subsecond response times on a single machine with two CPUs when used against 70 million pieces of content, while querying the entire index for relevant information
Support hundreds of thousands of enterprise users, or millions of web users, accessing hundreds of terabytes of data
Save storage space with an overall footprint of less than 15% of the original file size.

These metrics are quite amazing. To buttress the argument, the company quotes a number of consultants. Happy customers include Satyam, a firm that has been in a bit of a swamp. The write up about Autonomy IDOL’s security support is equally remarkable. I did a calculation based on public data about Google. You can find that write up here. Notice that Autonomy’s system processes more queries per second than Google’s, if these data are accurate. If you have other metrics about Autonomy or any other search engine, feel free to post these data in the comments section of this Web log.

Stephen Arnold, March 3, 2009

Written by Stephen E. Arnold · Filed Under Enterprise, News, Online (general), Search, Technology, Text processing

Comments

15 Responses to “Autonomy IDOL Metrics”

MarkH on March 3rd, 2009 4:47 am

I’d be interested to know:
1) How query performance degrades as more content is added (i.e. how fast was the querying on the 470m index vs 70m vs 1m)
2) What queries are executed for the 2,600 qps benchmark and on what content? Using the same query for all requests is just hitting cache (Autonomy’s or the operating system’s disk cache) and therefore not a realistic test.
freddieMaize on March 3rd, 2009 7:30 am

@MarkH : I agree…

GSA handles about 25 per sec and Fast, hundreds per sec. Source: no evidence. I heard this.

Anyways, intresting blog. Thanks
Charlie Hull on March 3rd, 2009 11:34 am

Of course, these metrics are pretty meaningless without independent testing.

It’s also interesting that Aut. is limited to 470 million documents – where does that number come from? If the document ID was 32 bit it would be 2 or 4 billion.
tzahi jakubovitz on March 3rd, 2009 6:23 pm

these numbers are very impressive, and totally meaningless.
let’s take the 2,600 qps number:
questions:
exact system configuration – at least CPUs, cores, memory,disks.
what type of queries ? how many parameters in the query ? I can get any qps you want if all the queries are on a unique single key of the document.
those 70,000,000 documents – what are their characteristics, size,variance etc.

all the other claims are just as meaningless.

can Authonomy please give the full information ?
otherwise – it is a hoax
Stephen E. Arnold on March 3rd, 2009 8:30 pm

Tzahi Jakubovitz,

You are a person who speaks directly to me. Feel free to comment and if you want to talk about these types of issues, write me any time. seaky2000 at yahoo dot com.

Stephen Arnold, March 3, 2009
Yahoo BOSS Queries per Second : Beyond Search on March 4th, 2009 9:28 am

[…] readers have commented via the blog feedback and by email about the Autonomy metrics I summarized here. To provide some baseline data, I dipped into my search archive an located an item that appeared in […]
Ben on March 4th, 2009 6:43 pm

I’ve had some exposure to Autonomy over the years and have mixed feelings. Anyone remember Kenjin? One way to get a feel for these metrics in practice would be for Autonomy to index Medline (and the full text of research papers where available) and let us all play with the result for a few weeks.
Daniel Tunkelang on March 4th, 2009 8:43 pm

These are interesting numbers, even if they are of questionable provenance. But I don’t see anything about the tasks being tested, or the effectiveness of the search engine at supporting the fulfillment of those tasks. Not to diminish the importance of indexing scale and query latency, but I’d think the first thing you would want to measure is effectiveness for users, or at least some reasonable proxy for it.

Of course, effectiveness and hard to measure, as witnessed by the challenges of doing so even in the academic community. The IR community generally likes TREC (which emphasizes mean average precision on benchmark test collections), but we HCIR folks keep reminding them that TREC assumes a highly unrealistic batch model of information retrieval. The information scientists usually call for user studies, but those are prohibitively expensive to run on any significant scale. Quite a conundrum.

Of course, we can give up on all that and just produce numbers. But I wonder who will actually care about them.
Stephen E. Arnold on March 5th, 2009 8:41 am

Daniel Tunkelang,

The numbers mean something to the financial mavens who have to throw hardware at a search system to make the thing run at what the customer thinks is an acceptable pace.

Stephen Arnold, March 5, 2009
freddieMaize on March 17th, 2009 1:23 am

Hi Stephen, the following news might interest you,

“MaxxCat enterprise search solutions has its sights set on Google with a new product line that is up to 16 times faster than Google’s search appliances. MaxxCat is aspiring to be the Google Search Appliance killer, boasting unlimited lifetime use, no artificial document size limitations, significantly lower price points, and a higher level of relevancy customization than any other search appliance on the market”

http://www.msnbc.msn.com/id/29640642/

Regards
New to search on March 26th, 2009 10:40 am

I work in the legal space that requires us to search potentially relevant files across an enterprise is critical for us. Can you unwrap the mystery (not to many IT terms) on idol vs lucene? Or is there a better alternative.
Stephen E. Arnold on March 26th, 2009 9:05 pm

Unsigned,

Yep, but I charge for this information. Free info is availablie at http://www.arnoldit.com and this Web log. For info about for fee options, check out the About tab above.

Stephen Arnold, March 26, 2009
Confluence: Product Marketing on October 8th, 2009 8:04 am

Performance Promotion…

A program designed to build awareness of Digitial Reef’s scale and performance profile. The target market includes consumers of eDiscovery solutions; Service Providers, Enterprise CIOs, Consultants and Law Firms…….
Enterprise Search Solutions on July 7th, 2011 2:21 pm

Just to add on to a previous posters comment – MaxxCAT offer two solid competitors to both Google Mini and GSA. MaxxCAT search appliances can perform queries at up to 16x faster speeds, can perform third-party site crawls, cost thousands less and offer lifetime perpetual use. You can even see a side by side demonstration of MaxxCAT’s offerings with the Google competition at http://www.maxxcat.com/compare-enterprise-search-appliance.html.
Sabina Obenauer on August 26th, 2011 4:27 pm

Really enjoyed this post, is there any way I can receive an email whenever you make a new article?

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.