The Internet as a Library and Archive? Ho Ho Ho
March 8, 2024
This essay is the work of a dumb dinobaby. No smart software required.
I know that I find certain Internet-related items a knee slapper. Here’s an example: “Millions of Research Papers at Risk of Disappearing from the Internet.” The number of individuals — young at heart and allegedly-informed seniors — think the “Internet” is a library or better yet an archive like the Library of Congress’ collection of “every” book.
A person deleting data with some degree of fierceness. Yep, thanks MSFT Copilot. After three tries, this is the best of the lot for a prompt asking for an illustration of data being deleted from a personal computer. Not even good enough but I like the weird orange coloration.
Here are some basics of how “Internet” services work:
- Every year costs go up of storage for old and usually never or rarely accessed data. A bean counter calls a meeting and asks, “Do we need to keep paying for ping, power, and pipes?” Some one points out, “Usage of X percent of the data described as “old” is 0.0003 percent or whatever number the bright young sprout has guess-timated. The decision is, as you might guess, dump the old files and reduce other costs immediately.
- Doing “data” or “online” is expensive, and the costs associated with each are very difficult, if not impossible to control. Neither government agencies, non-governmental outfits, the United Nations, a library in Cleveland or the estimable Harvard University have sufficient money to make available or keep at hand information. Thus, stuff disappears.
- Well-intentioned outfits like the Internet Archive or Project Gutenberg are in the same accountant ink pot. Not every Web site is indexed and archived comprehensively. Not every book that can be digitized and converted to a format someone thinks will be “forever.” As a result, one has a better chance of discovering new information browsing through donated manuscripts at the Vatican Library than running an online query.
- If something unique is online “somewhere,” that item may be unfindable. Hey, what about Duke University’s collection of “old” books from the 17th century? Who knew?
- Will a government agency archive digital content in a comprehensive manner? Nope.
The article about “risks of disappearing” is a hoot. Notice this passage:
“Our entire epistemology of science and research relies on the chain of footnotes,” explains author Martin Eve, a researcher in literature, technology and publishing at Birkbeck, University of London. “If you can’t verify what someone else has said at some other point, you’re just trusting to blind faith for artefacts that you can no longer read yourself.”
I like that word “epistemology.” Just one small problem: Trust. Didn’t the president of Stanford University have an opportunity to find his future elsewhere due to some data wonkery? Google wants to earn trust. Other outfits don’t fool around with trust; these folks gather data, exploit it, and resell it. Archiving and making it findable to a researcher or law enforcement? Not without friction, lots and lots of friction. Why verify? Estimates of non-reproducible research range from 15 percent to 40 percent of scientific, technical, and medical peer reviewed content. Trust? Hello, it’s time to wake up.
Many estimate how much new data are generated each year. I would suggest that data falling off the back end of online systems has been an active process. The first time an accountant hears the IT people say, “We can just roll off the old data and hold storage stable” is right up there with avoiding an IRS audit, finding a life partner, and billing an old person for much more than the accounting work is worth.
After 25 years, there is “risk.” Wow.
Stephen E Arnold, March 8, 2024