Apache Lucene 4.0 Changes Revealed

August 30, 2011

We prepared a report for a search vendor last week and reported that in our sample of organizations, more than 12 percent reported using open source software. Compared to three years ago, that’s a significant jump. Open source, despite the machinations of some large out fits, continues to make in roads in certain organizations. We learned that when there are strong advocates of open source working at an organization, there is a correlation between access to expertise and and internal cheerleader and the appetite for open source solutions.

Curious about the upcoming Apache Lucene 4.0? Ostatic gives us this “Guest Post: Under the Hood in Apache Lucene 4.0,” in which Lucene insider Simon Willnauer details a few big changes.

The decision to let go of backward compatibility allows for significant advances. For one, in the search engine library, indexing text strings are replaced with UTF8 bytes. This revision increases efficiency in term dictionary loading, memory usage, and search speeds. The change also allows for the much anticipated “flexible indexing.” Willnauer explains:

Optimized codecs can be loaded to suit the indexing of individual datasets or even individual fields. . . . New indexing codecs can be developed and existing ones updated without the need for hard-coding within Lucene. There is no longer any need for project-level compromise on the best general-purpose index formats and data structures.

Next, multiple threads will now be used for indexing. This shift makes better use of multi-core processing and input/output resources. Then there’s “concurrent flushing,” where each thread buffer can flush its memory separately without interfering with other users. Finally, a painstakingly revised Levenshtein Automation algorithm greatly improves fuzzy matching.

According to Willnauer, these tidbits are just the beginning. We agree, but the involvement of legal eagles could destabilize the open source band wagon.

Cynthia Murrell August 30, 2011

Sponsored by Pandia.com

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta