Harvard and a Web Archive Tool
May 18, 2023
Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.
The Library of Congress has dropped the ball and the Internet Archive may soon be shut down. So it is Harvard to the rescue. At least until people sue the institution. The university’s Library Innovation Lab describes its efforts in, “Witnessing the Web is Hard: Why and How We Built the Scoop Web Archiving Capture Engine.”
“Our decade of experience running Perma.cc has given our team a vantage point to identify emerging challenges in witnessing the web that we believe extend well beyond our core mission of preserving citations in the legal record. In an effort to expand the utility of our own service and contribute to the wider array of core tools in the web archiving community, we’ve been working on a handful of Perma Tools. In this blog post, we’ll go over the driving principles and architectural decisions we’ve made while designing the first major release from this series: Scoop, a high-fidelity, browser-based, single-page web archiving capture engine for witnessing the web. As with many of these tools, Scoop is built for general use but represents our particular stance, cultivated while working with legal scholars, US courts, and journalists to preserve their citations. Namely, we prioritize their needs for specificity, accuracy, and security. These are qualities we believe are important to a wide range of people interested in standing up their own web archiving system. As such, Scoop is an open-source project which can be deployed as a standalone building block, hopefully lowering a barrier to entry for web archiving.”
At Scoop’s core is its “no-alteration principle” which, as the name implies, is a commitment to recording HTTP exchanges with no variations. The write-up gives some technical details on how the capture engine achieves that standard. Aside from that bedrock doctrine, though, Scoop allows users to customize it to meet their unique web-witnessing needs. Attachments are optional and users can configure each element of the capture process, like time or size limits. Another pair of important features is the built-in provenance summary, including preservation of SSL certificates, and authenticity assertion through support for the Web Archive Collection Zipped (WACZ) file format and the WACZ Signing and Verification specification. Interested readers should see the article for details on how to start using Scoop. You might want to hurry, before publishers jump in with their inevitable litigation push.
Cynthia Murrell, May 18, 2023