Palantir Describes Lucene Searching with a Twist
January 27, 2010
If you do work in law enforcement, financial services, or intelligence (business or governmental), chances are high that you know about Palantir. The firm provides sophisticated data analysis and analytics tools for industrial-strength information jobs.
The company published in August 2009 and October 2009, a discussion of its approach to search and retrieval. I had occasion to update my file about Palantir technology, and I reviewed these two write ups. Both appeared in the Palantir Web log, and I thought that the information was relevant to some of the issues I am working on in 2010.
The first article is “Palantir: Search with a Twist (Part One: Memory Efficiency).” In that write up, the company points out that it uses the “venerable Java search engine Lucene.” Ah, open source, I thought. Palantir’s engineers encountered some limitations in Lucene and needed to work around these. The article explains that Palantir addressed Lucene’s approach to accumulating search results with a priority queue, streaming through results and inserting into the queue, and returning the set of results in the priority queue. The first article provides a useful summary of the Palantir method.
The second article is “Palantir: Search with a Twist (Part Two: Real-Time Indexing and Security).” This write up explains two approaches Palantir explored to deal with what the company calls “leaking information; namely that there’s data on this object that the user making the query is not privy to.” The write up says:
Given this problem, there are two approaches one can take: [1] Store all the information needed to decide which labels are visible to the user running the query and then use only the visible labels when calculating the relevance of a match. Note that is a pretty expensive operation. [2] Don’t use the length of match to compute relevance. Note that skipping a relevance calculation is, obviously, a very cheap thing do. Which do we do? Both.
I recommend that anyone wrestling with Lucene to take a look at these two articles. A third installment has been promised but I have not yet seen it.
Stephen E Arnold, January 27, 2010
A free search engine warrants a free post. No one paid me to write this. I will report this sad fact to the Department of Labor.
Comments
One Response to “Palantir Describes Lucene Searching with a Twist”
These two articles are good, because they exemplify one of the main advantages of using open source (search) technology – the ability to modify the code and customize it to your needs immediately (vs. hoping that the vendor will include the change in their next releases N months from now).
However, they are also examples of working in the vacuum. Both properties of Lucene that Palantir developers customized are now an integral part of Lucene. Unfortunately, the Palantir developers didn’t collaborate with the Lucene community, so now, only a couple of months after they blogged about their changes, they either have to keep using their forked version of Lucene, or they have to throw away their changes and just get the newer Lucene, or they have to try and merge the two. This is not the best use of open-source.