NARA Criticized for Searchability Limits

November 20, 2011

The US government has a love hate relationship with search, content processing, and predictive analytics. On one hand, agencies have to make information available to citizens. On the other hand, agencies have to be careful about what information to release, when, how, and in what form.

When there is a news story about search, my view is that somewhere, somehow a bureaucrat has tried to run a query and discovered that the system behaves in an interesting manner.

Now there’s been a development at the National Archives and Records Administration that we find very interesting. “NARA Officials Defend Searchability of Electronic Archive,” reports Federal Computer Week. I noted the word “defend”. When this word appears in a headline of a widely read government trade publication, I have a hunch that “interesting” veers toward the “concern” side of the connotative spectrum.

It seems a Federal auditor has criticized the organization’s new Electronic Records Archive because most of it is only keyword searchable, not text-searchable. Yes, that would be important because I like to run queries using what the publisher of this blog calls “free text.” The idea is that I can use my terms and assume that the system will perform synonym expansion, deduplication, and relevance ranking. I will, therefore, see results which have high precision (germane to my query) and high recall (the system does not leave out important items).

NARA official David Lake contends that the agency is doing its best with what it has. Most of the documents have been scanned in, which of course pose problems for content searches. NARA, according to the write up, is working on the system. Besides, according to the article, over time the problem will shrink on its own:

Over the next 10 years, as agencies deliver more material to the e-archive, the born-electronic documents in the archive will increase in number, making a larger portion of the e-archive searchable by text, even while scanned historic documents also are coming in, Lake added.

Big help if you’re researching World War I. The Search application is being supplied by Vivisimo, who inhabits the “information optimization” space. It seems that for $430 million, the contractors should be able to deliver what I think of as “commodity search” without too many Dancing with the Stars twirls. The issue is worth monitoring; it was a big contract and the Federal government is running a deficit. Maybe the US government will be able to deliver a basic search system that supports free text unless the document set has not been converted to a searchable file format. Ah, that pesky file transformation issue. Someone should have budgeted for that or gotten contractors with systems which can handle different file types as part of the standard content processing subsystem?

Interesting information optimization issue.

Cynthia Murrell, November 20, 2011

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.