Speeding Up Search: The Challenge of Multiple Bottlenecks
March 29, 2018
I read “Search at Scale Shows ~30,000X Speed Up.” I have been down this asphalt road before, many times in fact. The problem with search and retrieval is that numerous bottlenecks exist; for example, dealing with exceptions (content which the content processing system cannot manipulate).
Those who want relevant information or those who prefer superficial descriptions of search speed focus on a nice, easy-to-grasp metric; for example, how quickly do results display.
May I suggest you read the source document, work through the rat’s nest of acronyms, and swing your mental machete against the “metrics” in the write up?
Once you have taken these necessary steps, consider this statement from the write up:
These results suggest that we could use the high-quality matches of the RWMD to query — in sub-second time — at least 100 million documents using only a modest computational infrastructure.
The path to responsive search and retrieval is littered with multiple speed bumps. Hit any one when going to fast can break the search low rider.
I wish to list some of the speed bumps which the write does not adequately address or, in some cases, acknowledge:
- Content flows are often in the terabit or petabit range for certain filtering and query operations., One hundred million won’t ring the bell.
- This is the transform in ETL operations. Normalizing content takes some time, particularly when the historical on disc content from multiple outputs and real-time flows from systems ranging from Cisco Systems intercept devices are large. Please, think in terms of gigabytes per second and petabytes of archived data parked on servers in some countries’ government storage systems.
- Populating an index structure with new items also consumes time. If an object is not in an index of some sort, it is tough to find.
- Shaping the data set over time. Content has a weird property. It evolves. Lowly chat messages can contain a wide range of objects. Jump to today’s big light bulb which illuminates some blockchains’ ability house executables, videos, off color images, etc.
- Because IBM inevitably drags Watson to the party, keep in mind that Watson still requires humans to perform gorilla style grooming before it’s show time at the circus. Questions have to be considered. Content sources selected. The training wheels bolted to the bus. Then trials have to be launched. What good is a system which returns off point answers?
I think you get the idea.
The speed angle was floated by a company called Perfect Search. On certain speed bumps, Perfect Search shifted into warp speed. However, when we set up a test on my patent corpus, Perfect Search sucked up more time than my former benchmark system provided by Ian Davies, then president of ISYS Search Software. ISYS did take longer to display its results. But—and this is the key point—the other Perfect Search tasks took much longer to complete than ISYS’ system. As I recall, ISYS the test corpus up and running in less than a week. Perfect Search required about five weeks.
Which was, therefore, faster?
I submit that the efforts of search experts who are trying to pull off an MTV “Pimp My Ride” type of rework for search and retrieval have not yet delivered. There are good reasons why “search” has lost its luster. One of them is that “good enough” search is available as open source software. Not surprisingly, poke search systems hard enough and one can hear the muted cry, “Lucene.”
Net net. Comprehensive progress in search and retrieval stalled when the Fast Search & Transfer operation was illuminated by a government investigation. Since that time, old school search cheerleading has been muted. Now I think it is back with spiffy new uniforms but sadly the same old cheers. Gimme an “s”. Gimme a “p”. Gimme an “e.” Gimme an “e.” Gimme a “d.” What’s it spell?
Speed.
But like a high school cheer, the outcome of the game played on a muddy field is almost always decided by factors outside the span of control of the cheerleaders.
Stephen E Arnold, March 29, 2018