Text Processing: Why Servers Choke

September 6, 2008

Resource Shelf posted a link to a Hewlett Packard Labs’s paper. Great find. You can download the HP write up here (verified at 7 pm Eastern) on September 5, 2008. The paper argues that an HP innovation can process text at the rate of 100 megabytes per second per processor core. That’s quite fast. The value of the paper for me was that the authors of Extremely Fast Text Feature Extraction for Classification and Indexing” have done a thorough job of providing data about the performance of certain text processing systems. If you’ve been wondering how slow Lucene is, this paper gives you some metrics. The data seem to suggest that Lucene is a very slow horse in a slow race.

Another highlight of George Forman’s and Evan Kirshebaum’s write up was this statement:

Multiple disks or a 100 gigabit Ethernet feed from many client computers may certainly increase the input rate, but ultimately (multi-core) processing technology is getting faster faster than I/O bandwidth is getting faster. One potential avenue for future work is to push the general-
purpose text feature extraction algorithm closer to the disk hardware.  That is, for each file or block read, the disk controller itself could distill the bag-of-words representation and then transfer only this small amount  of data to the general-purpose processor.  This could enable much higher indexing or classification scanning rates than is currently feasible.  Another potential avenue is to investigate varying the hash function to improve classification performance, e.g. to avoid a particularly unfortunate collision between an important, predictive feature and a more frequent word that masks it.

When I read this, two thoughts came to mind:

  1. Search vendors counting on new multi core CPUs to solve performance problems won’t get the speed ups needed to make some systems process content more quickly. Bad news for one vendor whose system I just analyzed for a company convinced that performance is a strategic advantage. In short, slow loses.
  2. As more content is processed and short cuts taken, hash collisions can reduce the usefulness of the value-added processing. A query returns unexpected results. Much of the HP speed up is a series of short cuts. The problem is that short cuts can undermine what matters most to the user–getting the information needed to meet a need.

I urge you to read this paper. Quite a good piece of work. If you have other thoughts about this paper, please, share them.

Stephen Arnold, September 6, 2008

Comments

5 Responses to “Text Processing: Why Servers Choke”

  1. Text Processing: Why Servers Choke : Beyond Search on September 7th, 2008 7:35 am

    […] Processing: Why Servers Choke : Beyond Search Text Processing: Why Servers Choke : Beyond Search If you’ve been wondering how slow Lucene is, this paper gives you some metrics. The data seem to […]

  2. Hadoop Cluster Live-CD on September 8th, 2008 12:42 am

    […] about Text Processing performance http://arnoldit.com/wordpress/2008/09/06/text-processing-why-servers-choke/ (Beyond Search) Response from Grant Ingersoll: […]

  3. Always trading advice | Forex fundamental analysis on September 17th, 2008 7:54 am

    […] that most new forex traders make is to attempt to predict the sake on their own. Essentially, you turn on this goal and it makes profit. Follow this goal and profit. Trading # 4: Always how much you are […]

  4. George Forman on October 1st, 2008 11:25 pm

    A couple comments on your “two thoughts”:

    1. Agreed. Multi-core isn’t going to solve the I/O bandwidth bottleneck for indexing applications. (But once the indexing is done, we can leverage it to solve the I/O bandwidth problem for large scale classification applications [ http://www.hpl.hp.com/techreports/2008/HPL-2008-29R1.html ])

    2. If you’re concerned about random words colliding to the same hash value, then just increase the output hash width so collisions are arbitrarily rare. Google published their entire Web dictionary of word counts on a CD set [1TWeb] and it only has fewer than 14 million unique words that occur with any reasonable frequency. This number of tokens uses only 0.3% of the numbers in a 2**32 hash space, so collisions are pretty rare— probably less frequent than mispellings or synonyms that get in your way. But we can do better, if you like: since their average word length was 8.1 characters, we could use 8 byte hashes without using additional disk space. This results in 64-bit hashes, making collisions less frequent than 1 in 18.4 trillion.

    3. The comparison with Lucene v2.2 just shows there’s room for improvement, not that Lucene v2.2 is particularly slow compared to other packages. If speed is paramount, then use SpeedyFx in concert with Lucene, i.e. just as another text analyzer.

  5. Stephen E. Arnold on October 2nd, 2008 7:27 am

    George Forman,

    Knock out post. Keep helping me understand this world.

    Stephen Arnold, October 2, 2008

  • Archives

  • Recent Posts

  • Meta