<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Text Processing: Why Servers Choke</title>
	<atom:link href="http://arnoldit.com/wordpress/2008/09/06/text-processing-why-servers-choke/feed/" rel="self" type="application/rss+xml" />
	<link>http://arnoldit.com/wordpress/2008/09/06/text-processing-why-servers-choke/</link>
	<description>by Stephen E. Arnold</description>
	<lastBuildDate>Sun, 12 Feb 2012 09:55:49 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Stephen E. Arnold</title>
		<link>http://arnoldit.com/wordpress/2008/09/06/text-processing-why-servers-choke/comment-page-1/#comment-24357</link>
		<dc:creator>Stephen E. Arnold</dc:creator>
		<pubDate>Thu, 02 Oct 2008 12:27:44 +0000</pubDate>
		<guid isPermaLink="false">http://arnoldit.com/wordpress/?p=1725#comment-24357</guid>
		<description>George Forman,

Knock out post. Keep helping me understand this world.

Stephen Arnold, October 2, 2008</description>
		<content:encoded><![CDATA[<p>George Forman,</p>
<p>Knock out post. Keep helping me understand this world.</p>
<p>Stephen Arnold, October 2, 2008</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: George Forman</title>
		<link>http://arnoldit.com/wordpress/2008/09/06/text-processing-why-servers-choke/comment-page-1/#comment-24341</link>
		<dc:creator>George Forman</dc:creator>
		<pubDate>Thu, 02 Oct 2008 04:25:45 +0000</pubDate>
		<guid isPermaLink="false">http://arnoldit.com/wordpress/?p=1725#comment-24341</guid>
		<description>A couple comments on your &quot;two thoughts&quot;:

1.  Agreed.   Multi-core isn&#039;t going to solve the I/O bandwidth bottleneck for indexing applications.  (But once the indexing is done, we can leverage it to solve the I/O bandwidth problem for large scale classification applications [ http://www.hpl.hp.com/techreports/2008/HPL-2008-29R1.html ])

2.  If you&#039;re concerned about random words colliding to the same hash value, then just increase the output hash width so collisions are arbitrarily rare.   Google published their entire Web dictionary of word counts on a CD set [1TWeb] and it only has fewer than 14 million unique words that occur with any reasonable frequency.   This number of tokens uses only 0.3% of the numbers in a 2**32 hash space, so collisions are pretty rare--- probably less frequent than mispellings or synonyms that get in your way.   But we can do better, if you like:  since their average word length was 8.1 characters, we could use 8 byte hashes without using additional disk space.  This results in 64-bit hashes, making collisions less frequent than 1 in 18.4 trillion.

3. The comparison with Lucene v2.2 just shows there&#039;s room for improvement, not that Lucene v2.2 is particularly slow compared to other packages.  If speed is paramount, then use SpeedyFx in concert with Lucene, i.e. just as another text analyzer.</description>
		<content:encoded><![CDATA[<p>A couple comments on your &#8220;two thoughts&#8221;:</p>
<p>1.  Agreed.   Multi-core isn&#8217;t going to solve the I/O bandwidth bottleneck for indexing applications.  (But once the indexing is done, we can leverage it to solve the I/O bandwidth problem for large scale classification applications [ <a href="http://www.hpl.hp.com/techreports/2008/HPL-2008-29R1.html" rel="nofollow">http://www.hpl.hp.com/techreports/2008/HPL-2008-29R1.html</a> ])</p>
<p>2.  If you&#8217;re concerned about random words colliding to the same hash value, then just increase the output hash width so collisions are arbitrarily rare.   Google published their entire Web dictionary of word counts on a CD set [1TWeb] and it only has fewer than 14 million unique words that occur with any reasonable frequency.   This number of tokens uses only 0.3% of the numbers in a 2**32 hash space, so collisions are pretty rare&#8212; probably less frequent than mispellings or synonyms that get in your way.   But we can do better, if you like:  since their average word length was 8.1 characters, we could use 8 byte hashes without using additional disk space.  This results in 64-bit hashes, making collisions less frequent than 1 in 18.4 trillion.</p>
<p>3. The comparison with Lucene v2.2 just shows there&#8217;s room for improvement, not that Lucene v2.2 is particularly slow compared to other packages.  If speed is paramount, then use SpeedyFx in concert with Lucene, i.e. just as another text analyzer.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Always trading advice &#124; Forex fundamental analysis</title>
		<link>http://arnoldit.com/wordpress/2008/09/06/text-processing-why-servers-choke/comment-page-1/#comment-23105</link>
		<dc:creator>Always trading advice &#124; Forex fundamental analysis</dc:creator>
		<pubDate>Wed, 17 Sep 2008 12:54:18 +0000</pubDate>
		<guid isPermaLink="false">http://arnoldit.com/wordpress/?p=1725#comment-23105</guid>
		<description>[...] that most new forex traders make is to attempt to predict the sake on their own. Essentially, you turn on this goal and it makes profit. Follow this goal and profit. Trading # 4: Always how much you are [...]</description>
		<content:encoded><![CDATA[<p>[...] that most new forex traders make is to attempt to predict the sake on their own. Essentially, you turn on this goal and it makes profit. Follow this goal and profit. Trading # 4: Always how much you are [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Hadoop Cluster Live-CD</title>
		<link>http://arnoldit.com/wordpress/2008/09/06/text-processing-why-servers-choke/comment-page-1/#comment-22433</link>
		<dc:creator>Hadoop Cluster Live-CD</dc:creator>
		<pubDate>Mon, 08 Sep 2008 05:42:59 +0000</pubDate>
		<guid isPermaLink="false">http://arnoldit.com/wordpress/?p=1725#comment-22433</guid>
		<description>[...] about Text Processing performance http://arnoldit.com/wordpress/2008/09/06/text-processing-why-servers-choke/ (Beyond Search) Response from Grant Ingersoll: [...]</description>
		<content:encoded><![CDATA[<p>[...] about Text Processing performance <a href="http://arnoldit.com/wordpress/2008/09/06/text-processing-why-servers-choke/" rel="nofollow">http://arnoldit.com/wordpress/2008/09/06/text-processing-why-servers-choke/</a> (Beyond Search) Response from Grant Ingersoll: [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Text Processing: Why Servers Choke : Beyond Search</title>
		<link>http://arnoldit.com/wordpress/2008/09/06/text-processing-why-servers-choke/comment-page-1/#comment-22368</link>
		<dc:creator>Text Processing: Why Servers Choke : Beyond Search</dc:creator>
		<pubDate>Sun, 07 Sep 2008 12:35:31 +0000</pubDate>
		<guid isPermaLink="false">http://arnoldit.com/wordpress/?p=1725#comment-22368</guid>
		<description>[...] Processing: Why Servers Choke : Beyond Search  Text Processing: Why Servers Choke : Beyond Search If you’ve been wondering how slow Lucene is, this paper gives you some metrics. The data seem to [...]</description>
		<content:encoded><![CDATA[<p>[...] Processing: Why Servers Choke : Beyond Search  Text Processing: Why Servers Choke : Beyond Search If you’ve been wondering how slow Lucene is, this paper gives you some metrics. The data seem to [...]</p>
]]></content:encoded>
	</item>
</channel>
</rss>

