PDF Search from Dieselpoint

March 14, 2012

We heard Dieselpoint offers a PDF search engine, so we decided to check it out. This company keeps a very low profile, but we find it is worth looking into.

Dieselpoint’s PDF Search is an enterprise product that can navigate large collections of PDFs, extracting both metadata and text for indexing. Metadata can be searched and used to build more sophisticated interfaces in conjunction with Dieselpoint’s Search platform.

Often, titles are left out of a document’s metadata, making searches more challenging; Dieselpoint has an innovative solution for that. The product overview states:

Quite often, authors of PDFs neglect to enter titles into the document’s metadata. This makes it difficult to display a good, descriptive title when a PDF appears on a search results page. Dieselpoint Search eliminates this problem by providing ‘Smart Titles’. The system analyzes each PDF looking for clues as what the title might be, and employs advanced heuristics to select one. Studies show that Dieselpoint’s algorithm selects a title which is the same as the one that a human would have selected over 90% of the time.

This tool also takes advantage of XMP data, which resides in an XML file embedded within a PDF file. This data can contain information on subjects such as authors, digital rights, categories, and keywords.

Dieselpoint began developing the core indexing algorithms behind its search engine in 1999, and released version 1.0 the next year. Originally meant for use with engineered industrial goods, the product (and company) name reflects these origins.

Cynthia Murrell, March 14, 2012

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.