Embedding Lucene

January 31, 2010

The goslings and I participated in a search conference call last week. One of the topics du jour is Lucene. The open source search system continues to fascinate certain government procurement teams and those looking for a low-cost way to provide users with a search-and-retrieval system. The enthusiasm for Lucene and Solr goes up as the age of the information technology professionals decreases. Whatever universities are putting in the Red Bull sold in computer science departments seems to trigger a Lucene / Solr craving.

In the course of the conversation, I mentioned embedding Lucene in commercial software. The advantages ranged from low cost to sidestepping the blow-back from customers. The blow back occurs when the users of software want a feature not in the OEM “stub” embedded in a system or gizmo. The fix is to buy the full version of the software. The “stub” is a good enough chunk of functionality, but it won’t do the fancy back flips some users want when looking for information.

scribovox diagram

© Scribovox 2009

Lucene can be extended as long as the outfit doing the embedding has some Lucene experts on staff or access to a consultant able to keep appointments, complete work on time and in budget, and writes code that works. The example I gave was the Lucene within Scribovox.com.

Scribovox is a software that performs such tricks as converting a podcast to text. You can get more information about the product at http://www.scribvox.com. The information I referenced came from a June 17, 2009 Scribovox design document called “Integration with Social Networks.” I found the information in this write up quite useful, and you can download a copy of the paper from this link.

The author of the paper is Patrick Nicholas. He discusses some interesting ideas; for example:

  • Flow diagrams for processing real time content
  • A useful architecture diagram
  • A discussion of indexing and summarization
  • Some information about Amazon EC2, MapReduce and Hadoop.

If you are serious about open source, I would tuck this document in your bag of tricks. The time estimation puts search and semantics into perspective. Useful for the azure chip crowd since most don’t have too much, if any, oil under their fingers from removing the fuel injection unit from a search system.

Stephen E Arnold, January 31, 2010

A freebie. No one paid me to write this. I will report this charitable act to the boss at the National Cathedral on Wisconsin Avenue, in Washington, DC.

Comments

2 Responses to “Embedding Lucene”

  1. Eric Pugh on February 8th, 2010 8:53 pm

    Lucene and Solr, but especially Solr, are at the inflection point of becoming the dominant open source search engine solution. Any CIO/CTO looking to enhance search should have Solr on their list of search engines to evaluate. It’s got the simplicity of a well engineered open source solution (as well as a great price tag!), but also the sophisticated features to go head to head with the commercial guys like FAST and Endeca. Probably the biggest lack is GUI style management tools, and that is changing with the recent incubation of the Lucene Connector Framework. Of course, in the nature of full disclosure, I did drink the Red Bull, being one of the authors of the first book on Solr: Solr 1.4 Enterprise Search!

    What’s interesting is that Lucene and Solr look to be the gateway technology for more organizations to start apply semantic web and machine learning ideas with tools like Hadoop, Mahout, Droids, and the like!

  2. Stephen E. Arnold on February 9th, 2010 11:57 am

    Eric Pugh,

    Thanks for the comment.

    Stephen E Arnold, February 9, 2010

  • Archives

  • Recent Posts

  • Meta