Comments on: Text Processing: Why Servers Choke

By: Stephen E. Arnold

Stephen E. Arnold — Thu, 02 Oct 2008 12:27:44 +0000

George Forman,

Knock out post. Keep helping me understand this world.

Stephen Arnold, October 2, 2008

By: George Forman

George Forman — Thu, 02 Oct 2008 04:25:45 +0000

A couple comments on your “two thoughts”:

1. Agreed. Multi-core isn’t going to solve the I/O bandwidth bottleneck for indexing applications. (But once the indexing is done, we can leverage it to solve the I/O bandwidth problem for large scale classification applications [ http://www.hpl.hp.com/techreports/2008/HPL-2008-29R1.html ])

2. If you’re concerned about random words colliding to the same hash value, then just increase the output hash width so collisions are arbitrarily rare. Google published their entire Web dictionary of word counts on a CD set [1TWeb] and it only has fewer than 14 million unique words that occur with any reasonable frequency. This number of tokens uses only 0.3% of the numbers in a 2**32 hash space, so collisions are pretty rare— probably less frequent than mispellings or synonyms that get in your way. But we can do better, if you like: since their average word length was 8.1 characters, we could use 8 byte hashes without using additional disk space. This results in 64-bit hashes, making collisions less frequent than 1 in 18.4 trillion.

3. The comparison with Lucene v2.2 just shows there’s room for improvement, not that Lucene v2.2 is particularly slow compared to other packages. If speed is paramount, then use SpeedyFx in concert with Lucene, i.e. just as another text analyzer.

By: Always trading advice | Forex fundamental analysis

Always trading advice | Forex fundamental analysis — Wed, 17 Sep 2008 12:54:18 +0000

[…] that most new forex traders make is to attempt to predict the sake on their own. Essentially, you turn on this goal and it makes profit. Follow this goal and profit. Trading # 4: Always how much you are […]

By: Hadoop Cluster Live-CD

Hadoop Cluster Live-CD — Mon, 08 Sep 2008 05:42:59 +0000

[…] about Text Processing performance http://arnoldit.com/wordpress/2008/09/06/text-processing-why-servers-choke/ (Beyond Search) Response from Grant Ingersoll: […]

By: Text Processing: Why Servers Choke : Beyond Search

Text Processing: Why Servers Choke : Beyond Search — Sun, 07 Sep 2008 12:35:31 +0000

[…] Processing: Why Servers Choke : Beyond Search Text Processing: Why Servers Choke : Beyond Search If you’ve been wondering how slow Lucene is, this paper gives you some metrics. The data seem to […]