Lucene: Merits a Test Drive and a Close Look

January 27, 2008

On Friday, I gave an invited lecture at the Speed School, the engineering and computer science department of the University of Louisville. After the talk on Google’s use of machine learning, one of the Ph.D. candidates asked me about Lucene. Lucene, as you may know, is the open source search engine authored by one of the Excite developers. If you want background on Lucene, the Wikipedia entry is a good place to start, and I didn’t spot any egregious errors when I scanned it earlier today. My interest is behind the firewall search and content processing. My comments, therefore, reflect my somewhat narrow view of Lucene and other retrieval systems. I told the student that I would offer some comments about Lucene and provide him with a few links.

Background

Lucene’s author is Doug Cutting, who worked at Xerox’s Palo Alto Research Center and eventually landed at Excite. After Excite was absorbed into Excite@Home, he needed to learn Java. He wrote Lucene as an exercise. Lucene (his wife’s middle name) was contributed to the Apache project, and you can download a copy, documentation, and sample code here. An update — Java Version 2.3.0 — became available on January 24, 2008.

What It Does

Lucene permits key word and fielded search. You can use Boolean AND, OR, and NOT to formulate complex queries. The system permits fuzzy search, useful when searching text created by optical character recognition. You can also set up the system to display similar results, roughly the same as See Also references. You can set up the system to index documents. When a user requests a source document, that document must be retrieved over the local network. If you want to minimize the bandwidth hit, you can configure Lucene to store an archive of the processed documents. If the system processes structured content, you can search by the field tags, sort these results, and perform other manipulations. There is an administrative component which is accessed via a command line.
In a nutshell, you can use Lucene as a search and retrieval system.

Selected Features

You will want to have adequate infrastructure to operate the system, serve queries, and process content. When properly configured, you will be able to handle collections that number hundreds of millions of documents. Lucene delivers good relevancy when properly configured. Like a number of search and content processing systems, the administrative tools allow the search administrator to tweak the relevance engine. Among the knobs and dials you can twirl are document weights so you can boost or suppress certain results. As you dig through the documentation, you will find guidance for run time term weights, length normalization, and field weights, among others. A bit of advice — run the system in the default mode on a test set of documents so you can experiment with various configuration and administrative settings. The current version improves on the system’s ability to handle processes in parallel. Indexing speed and query response time, when properly set up and resourced, are as good as or better than some commercial products’ responsiveness.

Strengths and Weaknesses

The Lucene Web site provides good insight into the strengths and weaknesses of a Lucene search system. The biggest plus is that you can download the system, install it on a Linux, UNIX, or Windows server and provide a stable, functional key word and fielded search system. In the last three or four years, the system has made significant improvements in processing speed, reducing the size of the index footprint (now about 25 percent of the source documents’ size), incremental updates, support for index partitions, and other useful enhancements.

The downside of Lucene is that a non-programmer will not be able to figure out how to install, test, configure, and deploy the system. Open source programs are often quite good technically, but some lack the graphical interfaces and training wheels that are standard with some commercial search and content processing systems. You will be dependent on the Lucene community to help you resolve some issues. You may find that your request for support results in a Lucene aficionado suggesting that you use another open source tool to resolve a particular issue. You will also have to hunt around for some content filters, or you will be forced to code your own import filters. Lucene has not been engineered to deliver the type of security found in Oracle’s SES 11g system, so expect to spend some time making sure users can access only content at their clearance level.

When to Use Lucene

If you have an interest in behind-the-firewall search, you should download, install, and test the system. Lucene provides an excellent learning experience. I would have no hesitation installing Lucene in an organization where money for a brand name search system was not available. The caveat is that I am confident in my ability to download, install, debug, configure, and deploy the system. If you lack the requisite skills, you can still use Lucene. In September 2007, I met the founders of Tesuji, a company with business offices in Rome, Italy, and technical operations, in Budapest, Hungary. This company provides technical support for Lucene, customization services, and provides a build that the company has configured. Information about Tesuji is here. Another option is to download SOLR, which is a wrapper for Lucene. SOLR provides a number of features but the one that is quite useful is the Web-based administrative interface. When you poke under the SOLR hood, you will find tools to replicate indexes and perform other chores.

What surprises a number of people is the wide use of Lucene. IBM makes use of it. Siderean Software can include it in their system if the customer requires a search system as well as Siderean’s advanced content processing tools. Project Gutenberg uses it as well.
Some organizations have yet to embrace open source software. If you have the necessary expertise, give it a test drive. Compared to the $300,000 or higher first-year license fees some search and content processing vendors demand, Lucene is an outstanding value.

Stephen Arnold, January 27,2008

Written by Stephen E. Arnold · Filed Under Enterprise, Search

Comments

8 Responses to “Lucene: Merits a Test Drive and a Close Look”

How Often Should You Exercise on January 28th, 2008 2:30 am

[…] Lucene: Merits a Test Drive and a Close Look […]
Beyond Search » OmniFind: IBM’s Search Work Horse on March 17th, 2008 6:06 pm

[…] According to my sources, OmniFind makes use of Lucene, an open source search system. I wrote about Lucene in this Web log in January […]
DBsight Search: Worth a Closer Look : Beyond Search on April 30th, 2008 11:45 am

[…] or updated records. The system also supports the Eden space strategy which is one way to make Lucene more […]
tss on May 25th, 2008 6:23 pm

SearchBlox which also uses Lucene has released their product for for Amazon Cloud Computing (EC2) platform. Fire up the Searchblox AMI and your are ready to use the search. http://developer.amazonwebservices.com/connect/message.jspa?messageID=88258

http://www.searchblox.com/gettingstarted_amazon_ec2.html

Now anyone can index 1 million documents at $1 per CPU hour!!!
SAS: BI Giant Sends Mixed Signals : Beyond Search on May 26th, 2008 6:39 am

[…] which I describe in some detail here, is Apache open source search engine. Lucene forms the basis of the IBM Yahoo “free” […]
Simplexo: Another Open Source Enterprise Search Platform : Beyond Search on September 11th, 2008 12:01 am

[…] Consulting. I have also written about Tesuji here. You can also take a look at my write up about Lucene here. These open source options are selective, not […]
Cooking a pork sholder. on November 23rd, 2009 8:17 am

Cooking a pork sholder….

Cooking a pork sholder….
Not So Fast, Folks : Beyond Search on September 15th, 2011 4:00 pm

[…] Tesuji, and other information access companies–was using the open source search system Lucene as a base upon which to […]

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.