Palantir Describes Lucene Searching with a Twist

January 27, 2010

If you do work in law enforcement, financial services, or intelligence (business or governmental), chances are high that you know about Palantir. The firm provides sophisticated data analysis and analytics tools for industrial-strength information jobs.

The company published in August 2009 and October 2009, a discussion of its approach to search and retrieval. I had occasion to update my file about Palantir technology, and I reviewed these two write ups. Both appeared in the Palantir Web log, and I thought that the information was relevant to some of the issues I am working on in 2010.

The first article is “Palantir: Search with a Twist (Part One: Memory Efficiency).” In that write up, the company points out that it uses the “venerable Java search engine Lucene.” Ah, open source, I thought. Palantir’s engineers encountered some limitations in Lucene and needed to work around these. The article explains that Palantir addressed Lucene’s approach to accumulating search results with a priority queue, streaming through results and inserting into the queue, and returning the set of results in the priority queue. The first article provides a useful summary of the Palantir method.

The second article is “Palantir: Search with a Twist (Part Two: Real-Time Indexing and Security).” This write up explains two approaches Palantir explored to deal with what the company calls “leaking information; namely that there’s data on this object that the user making the query is not privy to.” The write up says:

Given this problem, there are two approaches one can take: [1] Store all the information needed to decide which labels are visible to the user running the query and then use only the visible labels when calculating the relevance of a match. Note that is a pretty expensive operation. [2] Don’t use the length of match to compute relevance. Note that skipping a relevance calculation is, obviously, a very cheap thing do. Which do we do? Both.

I recommend that anyone wrestling with Lucene to take a look at these two articles. A third installment has been promised but I have not yet seen it.

Stephen E Arnold, January 27, 2010

A free search engine warrants a free post. No one paid me to write this. I will report this sad fact to the Department of Labor.

Written by Stephen E. Arnold · Filed Under Enterprise, News, Open source, Search, Technology, Text processing

Comments

One Response to “Palantir Describes Lucene Searching with a Twist”

Otis Gospodnetic on January 28th, 2010 12:56 pm

These two articles are good, because they exemplify one of the main advantages of using open source (search) technology – the ability to modify the code and customize it to your needs immediately (vs. hoping that the vendor will include the change in their next releases N months from now).

However, they are also examples of working in the vacuum. Both properties of Lucene that Palantir developers customized are now an integral part of Lucene. Unfortunately, the Palantir developers didn’t collaborate with the Lucene community, so now, only a couple of months after they blogged about their changes, they either have to keep using their forked version of Lucene, or they have to throw away their changes and just get the newer Lucene, or they have to try and merge the two. This is not the best use of open-source.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.