Digital Dodos: Fed Web Site Archives

April 13, 2008

Computerworld‘s Heather Havenstein wrote a story on April 11, 2008 “Agency Under Fire for Decision Not to Save Federal Web Content”. Please, read it before it goes into the digital never-never land of Computerworld stories, thus becoming almost impossible to find without real sleuthing.

The key point in the story was for me:

NARA, which until this year had collected a “harvests” of federal Web sites at the end of presidential and congressional terms, said in a recent memo that it would discontinue the practice at the end of George W. Bush’s presidency.

NARA for the acronym-challenged is the National Archives and Records Administration. This Federal entity is supposed to keep a copy of government information. Now, government information is slippery, and it is very difficult to put it in one location.

In year 2000, I was one of the lucky dweebs involved in the US Federal government’s citizen-facing portal, now called USA.gov. As part of that project, Inktomi indexed more than 20,000 public facing Web servers and made the information searchable. I thought indexing Federal Web sites would be a piece of cake. Boy, was I wrong.

A Search Puzzle with Hundreds of Pieces

Just take a gander at the Government Printing Office catalog and then do a bit of poking into the Web sites of the Department of Energy, and you won’t find much overlap for big printed reports and studies. For even more government fun, run a query on DEO for “ECCS”. You will get zero results. Now run the query on www.usa.gov, and you get hits to a nuclear power plant’s “nuclear core cooling system”. Related information is not in a single place, and there are different filters in place on different agencies’ Web sites. In short, the job of NARA is gather the information in one place for research or crazed attorneys. There are overlapping jurisdictions, of course. It’s murky water. Few know who is responsible for what information at what point in time.

The same wacky situation plagues the Library of Congress, the library in the US Senate, and the two dozen executive branch agencies. I don’t even want to think about figuring out the information on the public and not-so-public Web sites operated by various intelligence, military, and quasi-government entities. (Remember, I struggled with this information landscape until I threw in the digital towel in 2006.)

You will have to form your own opinion about what information should be gathered by whom. I only know that trying to figure out which agency has what information is no trivial job. With NARA seemingly giving up and other Federal entities grabbing different parts of the information elephants, there may be no solution. Alexa and the Internet Archive have tried / are trying to do the work, but over the years, I’m less and less confident with those efforts.

Microsoft indexes some Federal content as part of its contract for USA.gov with Vivisimo, but that’s a hit and miss index based on my tests. Microsoft asserts that it has more than 30 billion Web pages in its index, but my tests don’t back up that claim. Microsoft is struggling to make resources available for its various initiatives, and I think the index of Federal government content is not at the top of that list. Google indexes a cart load of government information, including a decent job of a number of states’ content.

Let Google Do It

I’m all for letting the GOOG index the Federal govrenment, store the data in the Googleplex, and call it a day. At least I would know where to look for my “emergency core cooling system” documents and the report I did in 1991 about Japan’s investments in high-speed network technology. Under the present system, the information is essentially unfindable with public-facing systems.

If you know a specific item exists, it can be almost impossible to find it on any public index. In my experience, you have to able to log in to the agency’s network and go data spelunking, find a version of the document, and then gather up the different instances of the document to figure out which is the “official” one. Just when you think you have what you need, someone asks, “Did you check the Lotus Notes’ repository? I think there are some modifications in those files too.” So, it’s back to the old data cave for more exploration in the dark. My miner’s light burned out, and I won’t go into the dark any more.

Stephen Arnold, April 13, 2008

Comments

Comments are closed.