Lucene: Merits a Test Drive and a Close Look
January 27, 2008
On Friday, I gave an invited lecture at the Speed School, the engineering and computer science department of the University of Louisville. After the talk on Google’s use of machine learning, one of the Ph.D. candidates asked me about Lucene. Lucene, as you may know, is the open source search engine authored by one of the Excite developers. If you want background on Lucene, the Wikipedia entry is a good place to start, and I didn’t spot any egregious errors when I scanned it earlier today. My interest is behind the firewall search and content processing. My comments, therefore, reflect my somewhat narrow view of Lucene and other retrieval systems. I told the student that I would offer some comments about Lucene and provide him with a few links.
Background
Lucene’s author is Doug Cutting, who worked at Xerox’s Palo Alto Research Center and eventually landed at Excite. After Excite was absorbed into Excite@Home, he needed to learn Java. He wrote Lucene as an exercise. Lucene (his wife’s middle name) was contributed to the Apache project, and you can download a copy, documentation, and sample code here. An update — Java Version 2.3.0 — became available on January 24, 2008.
What It Does
Lucene permits key word and fielded search. You can use Boolean AND, OR, and NOT to formulate complex queries. The system permits fuzzy search, useful when searching text created by optical character recognition. You can also set up the system to display similar results, roughly the same as See Also references. You can set up the system to index documents. When a user requests a source document, that document must be retrieved over the local network. If you want to minimize the bandwidth hit, you can configure Lucene to store an archive of the processed documents. If the system processes structured content, you can search by the field tags, sort these results, and perform other manipulations. There is an administrative component which is accessed via a command line.
In a nutshell, you can use Lucene as a search and retrieval system.
Selected Features
You will want to have adequate infrastructure to operate the system, serve queries, and process content. When properly configured, you will be able to handle collections that number hundreds of millions of documents. Lucene delivers good relevancy when properly configured. Like a number of search and content processing systems, the administrative tools allow the search administrator to tweak the relevance engine. Among the knobs and dials you can twirl are document weights so you can boost or suppress certain results. As you dig through the documentation, you will find guidance for run time term weights, length normalization, and field weights, among others. A bit of advice — run the system in the default mode on a test set of documents so you can experiment with various configuration and administrative settings. The current version improves on the system’s ability to handle processes in parallel. Indexing speed and query response time, when properly set up and resourced, are as good as or better than some commercial products’ responsiveness.
Strengths and Weaknesses
The Lucene Web site provides good insight into the strengths and weaknesses of a Lucene search system. The biggest plus is that you can download the system, install it on a Linux, UNIX, or Windows server and provide a stable, functional key word and fielded search system. In the last three or four years, the system has made significant improvements in processing speed, reducing the size of the index footprint (now about 25 percent of the source documents’ size), incremental updates, support for index partitions, and other useful enhancements.
The downside of Lucene is that a non-programmer will not be able to figure out how to install, test, configure, and deploy the system. Open source programs are often quite good technically, but some lack the graphical interfaces and training wheels that are standard with some commercial search and content processing systems. You will be dependent on the Lucene community to help you resolve some issues. You may find that your request for support results in a Lucene aficionado suggesting that you use another open source tool to resolve a particular issue. You will also have to hunt around for some content filters, or you will be forced to code your own import filters. Lucene has not been engineered to deliver the type of security found in Oracle’s SES 11g system, so expect to spend some time making sure users can access only content at their clearance level.
When to Use Lucene
If you have an interest in behind-the-firewall search, you should download, install, and test the system. Lucene provides an excellent learning experience. I would have no hesitation installing Lucene in an organization where money for a brand name search system was not available. The caveat is that I am confident in my ability to download, install, debug, configure, and deploy the system. If you lack the requisite skills, you can still use Lucene. In September 2007, I met the founders of Tesuji, a company with business offices in Rome, Italy, and technical operations, in Budapest, Hungary. This company provides technical support for Lucene, customization services, and provides a build that the company has configured. Information about Tesuji is here. Another option is to download SOLR, which is a wrapper for Lucene. SOLR provides a number of features but the one that is quite useful is the Web-based administrative interface. When you poke under the SOLR hood, you will find tools to replicate indexes and perform other chores.
What surprises a number of people is the wide use of Lucene. IBM makes use of it. Siderean Software can include it in their system if the customer requires a search system as well as Siderean’s advanced content processing tools. Project Gutenberg uses it as well.
Some organizations have yet to embrace open source software. If you have the necessary expertise, give it a test drive. Compared to the $300,000 or higher first-year license fees some search and content processing vendors demand, Lucene is an outstanding value.
Stephen Arnold, January 27,2008
Comments
8 Responses to “Lucene: Merits a Test Drive and a Close Look”
[…] Lucene: Merits a Test Drive and a Close Look […]
[…] According to my sources, OmniFind makes use of Lucene, an open source search system. I wrote about Lucene in this Web log in January […]
[…] or updated records. The system also supports the Eden space strategy which is one way to make Lucene more […]
SearchBlox which also uses Lucene has released their product for for Amazon Cloud Computing (EC2) platform. Fire up the Searchblox AMI and your are ready to use the search. http://developer.amazonwebservices.com/connect/message.jspa?messageID=88258
http://www.searchblox.com/gettingstarted_amazon_ec2.html
Now anyone can index 1 million documents at $1 per CPU hour!!!
[…] which I describe in some detail here, is Apache open source search engine. Lucene forms the basis of the IBM Yahoo “free” […]
[…] Consulting. I have also written about Tesuji here. You can also take a look at my write up about Lucene here. These open source options are selective, not […]
Cooking a pork sholder….
Cooking a pork sholder….
[…] Tesuji, and other information access companies–was using the open source search system Lucene as a base upon which to […]