Suddenly: Worrying about Content Preservation
August 19, 2024
This essay is the work of a dumb dinobaby. No smart software required.
Digital preservation may be becoming a hot topic for those who rarely think about finding today’s information tomorrow or even later today. Two write ups provide some hooks on which thoughts about finding information could be hung.
The young scholar faces some interesting knowledge hurdles. Traditional institutions are not much help. Thanks, MSFT Copilot. Is Outlook still crashing?
The first concerns PDFs. The essay and how to is “Classifying All of the PDFs on the Internet.” A happy quack to the individual who pursued this project, presented findings, and provided links to the data sets. Several items struck me as important in this project research report:
- Tracking down PDF files on the “open” Web is not something that can be done with a general Web search engine. The takeaway for me is that PDFs, like PowerPoint files, are either skipped or not crawled. The author had to resort to other, programmatic methods to find these file types. If an item cannot be “found,” it ceases to exist. How about that for an assertion, archivists?
- The distribution of document “source” across the author’s prediction classes splits out mathematics, engineering, science, and technology. Considering these separate categories as one makes clear that the PDF universe is about 25 percent of the content pool. Since technology is a big deal for innovators and money types, losing or not being able to access these data suggest a knowledge hurdle today and tomorrow in my opinion. An entity capturing these PDFs and making them available might have a knowledge advantage.
- Entities like national libraries and individualized efforts like the Internet Archive are not capturing the full sweep of PDFs based on my experience.
My reading of the essay made me recognize that access to content on the open Web is perceived to be easy and comprehensive. It is not. Your mileage may vary, of course, but this write up illustrates a large, multi-terabyte problem.
The second story about knowledge comes from the Epstein-enthralled institution’s magazine. This article is “The Race to Save Our Online Lives from a Digital Dark Age.” To make the urgency of the issue more compelling and better for the Google crawling and indexing system, this subtitle adds some lemon zest to the dish of doom:
We’re making more data than ever. What can—and should—we save for future generations? And will they be able to understand it?
The write up states:
For many archivists, alarm bells are ringing. Across the world, they are scraping up defunct websites or at-risk data collections to save as much of our digital lives as possible. Others are working on ways to store that data in formats that will last hundreds, perhaps even thousands, of years.
The article notes:
Human knowledge doesn’t always disappear with a dramatic flourish like GeoCities; sometimes it is erased gradually. You don’t know something’s gone until you go back to check it. One example of this is “link rot,” where hyperlinks on the web no longer direct you to the right target, leaving you with broken pages and dead ends. A Pew Research Center study from May 2024 found that 23% of web pages that were around in 2013 are no longer accessible.
Well, the MIT story has a fix:
One way to mitigate this problem is to transfer important data to the latest medium on a regular basis, before the programs required to read it are lost forever. At the Internet Archive and other libraries, the way information is stored is refreshed every few years. But for data that is not being actively looked after, it may be only a few years before the hardware required to access it is no longer available. Think about once ubiquitous storage mediums like Zip drives or CompactFlash.
To recap, one individual made clear that PDF content is a slippery fish. The other write up says the digital content itself across the open Web is a lot of slippery fish.
The fix remains elusive. The hurdles are money, copyright litigation, and technical constraints like storage and indexing resources.
Net net: If you want to preserve an item of information, print it out on some of the fancy Japanese archival paper. An outfit can say it archives, but in reality the information on the shelves is a tiny fraction of what’s “out there”.
Stephen E Arnold, August 19, 2024
Some Fun with Synthetic Data: Includes a T Shirt
August 12, 2024
This essay is the work of a dumb dinobaby. No smart software required.
Academics and researchers often produce bogus results, fiddle images (remember the former president of Stanford University), or just make up stuff. Despite my misgivings, I want to highlight what appear to be semi-interesting assertions about synthetic data. For those not following the nuances of using real data, doing some mathematical cartwheels, and producing made-up data which are just as good as “real” data, synthetic data for me is associated with Dr. Chris Ré, the Stanford Artificial Intelligence Laboratory (remember the ex president of Stanford U., please). The term or code word for this approach to information suitable for training smart software is Snorkel. Snorkel became as company. Google embraced Snorkel. The looming litigation and big dollar settlements may make synthetic data a semi big thing in a tech dust devil called artificial intelligence. The T Shirt should read, “Synthetic data are write” like this:
I asked an AI system provided by the global leaders in computer security (yep, that’s Microsoft) to produce a T shirt for a synthetic data team. Great work and clever spelling to boot.
The “research” report appeared in Live Science. “AI Models Trained on Synthetic Data Could Break Down and Regurgitate Unintelligible Nonsense, Scientists Warn” asserts:
If left unchecked,”model collapse” could make AI systems less useful, and fill the internet with incomprehensible babble.
The unchecked term is a nice way of saying that synthetic data are cheap and less likely to become a target for copyright cops.
The article continues:
AI models such as GPT-4, which powers ChatGPT, or Claude 3 Opus rely on the many trillions of words shared online to get smarter, but as they gradually colonize the internet with their own output they may create self-damaging feedback loops. The end result, called “model collapse” by a team of researchers that investigated the phenomenon, could leave the internet filled with unintelligible gibberish if left unchecked.
People who think alike and create synthetic data will prove that “fake” data are as good as or better than “real” data. Why would anyone doubt such glib, well-educated people. Not me! Thanks, MSFT Copilot. Have you noticed similar outputs from your multitudinous AI systems?
In my opinion, the Internet when compared to commercial databases produced with actual editorial policies has been filled with “unintelligible gibberish” from the days I showed up at conferences to lecture about how hypertext was different from Gopher and Archie. When Mosaic sort of worked, I included that and left my Next computer at the office.
The write up continues:
As the generations of self-produced content accumulated, the researchers watched their model’s responses degrade into delirious ramblings.
After the data were fed into the system a number of time, the output presented was like this example from the researchers’ tests:
“architecture. In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-.”
The output might be helpful to those interested in church architecture.
Here’s the wrap up to the research report:
This doesn’t mean doing away with synthetic data entirely, Shumailov said, but it does mean it will need to be better designed if models built on it are to work as intended. [Note: Ilia Shumailov, a computer scientist at the University of Oxford, worked on this study.]
I must admit that the write up does not make clear what data were “real” and what data were “synthetic.” I am not sure how the test moved from Wikipedia to synthetic data. I have no idea where the headline originated? Was it synthetic?
Nevertheless, I think one might conclude that using fancy math to make up data that’s as good as real life data might produce some interesting outputs.
Stephen E Arnold, August 12, 2024
The Only Dataset Search Tool: What Does That Tell Us about Google?
April 11, 2024
This essay is the work of a dumb dinobaby. No smart software required.
If you like semi-jazzy, academic write ups, you will revel in “Discovering Datasets on the Web Scale: Challenges and Recommendations for Google Dataset Search.” The write up appears in a publication associated with Jeffrey Epstein’s favorite university. It may be worth noting that MIT and Google have teamed to offer a free course in Artificial Intelligence. That is the next big thing which does hallucinate at times while creating considerable marketing angst among the techno-giants jousting to emerge as the go-to source of the technology.
Back to the write up. Google created a search tool to allow a user to locate datasets accessible via the Internet. There are more than 700 data brokers in the US. These outfits will sell data to most people who can pony up the cash. Examples range from six figure fees for the Twitter stream to a few hundred bucks for boat license holders in states without much water.
The write up says:
Our team at Google developed Dataset Search, which differs from existing dataset search tools because of its scope and openness: potentially any dataset on the web is in scope.
A very large, money oriented creature enjoins a worker to gather data. If someone asks, “Why?”, the monster says, “Make up something.” Thanks MSFT Copilot. How is your security today? Oh, that’s too bad.
The write up does the academic thing of citing articles which talk about data on the Web. There is even a table which organizes the types of data discovery tools. The categorization of general and specific is brilliant. Who would have thought there were two categories of a vertical search engine focused on Web-accessible data. I thought there was just one category; namely, gettable. The idea is that if the data are exposed, take them. Asking permission just costs time and money. The idea is that one can apologize and keep the data.
The article includes a Googley graphic. The French portal, the Italian “special” portal, and the Harvard “dataverse” are identified. Were there other Web accessible collections? My hunch is that Google’s spiders such down as one famous Googler said, “All” the world’s information. I will leave it to your imagination to fill in other sources for the dataset pages. (I want to point out that Google has some interesting technology related to converting data sets into normalized data structures. If you are curious about the patents, just write benkent2020 at yahoo dot com, and one of my researchers will send along a couple of US patent numbers. Impressive system and method.)
The section “Making Sense of Heterogeneous Datasets” is peculiar. First, the Googlers discovered the basic fact of data from different sources — The data structures vary. Think in terms of grapes and deer droppings. Second, the data cannot be “trusted.” There is no fix to this issue for the team writing the paper. Third, the authors appear to be unaware of the patents I mentioned, particularly the useful example about gathering and normalizing data about digital cameras. The method applies to other types of processed data as well.
I want to jump to the “beyond metadata” idea. This is the mental equivalent of “popping” up a perceptual level. Metadata are quite important and useful. (Isn’t it odd that Google strips high value metadata from its search results; for example, time and data?) The authors of the paper work hard to explain that the Google approach to data set search adds value by grouping, sorting, and tagging with information not in any one data set. This is common sense, but the Googley spin on this is to build “trust.” Remember: This is an alleged monopolist engaged in online advertising and co-opting certain Web services.
Several observations:
- This is another of Google’s high-class PR moves. Hooking up with MIT and delivering razz-ma-tazz about identifying spiderable content collections in the name of greater good is part of the 2024 Code Red playbook it seems. From humble brag about smart software to crazy assertions like quantum supremacy, today’s Google is a remarkable entity
- The work on this “project” is divorced from time. I checked my file of Google-related information, and I found no information about the start date of a vertical search engine project focused on spidering and indexing data sets. My hunch is that it has been in the works for a while, although I can pinpoint 2006 as a year in which Google’s technology wizards began to talk about building master data sets. Why no time specifics?
- I found the absence of AI talk notable. Perhaps Google does not think a reader will ask, “What’s with the use of these data? I can’t use this tool, so why spend the time, effort, and money to index information from a country like France which is not one of Google’s biggest fans. (Paris was, however, the roll out choice for the answer to Microsoft and ChatGPT’s smart software announcement. Plus that presentation featured incorrect information as I recall.)
Net net: I think this write up with its quasi-academic blessing is a bit of advance information to use in the coming wave of litigation about Google’s use of content to train its AI systems. This is just a hunch, but there are too many weirdnesses in the academic write up to write off as intern work or careless research writing which is more difficult in the wake of the stochastic monkey dust up.
Stephen E Arnold, April 11, 2024
French Building and Structure Geo-Info
February 23, 2024
This essay is the work of a dumb dinobaby. No smart software required.
OSINT professionals may want to take a look at a French building and structure database with geo-functions. The information is gathered and made available by the Observatoire National des Bâtiments. Registration is required. A user can search by city and address. The data compiled up to 2022 cover France’s metropolitan areas and includes geo services. The data include address, the built and unbuilt property, the plot, the municipality, dimensions, and some technical data. The data represent a significant effort, involving the government, commercial and non-governmental entities, and citizens. The dataset includes more than 20 million addresses. Some records include up to 250 fields.
Source: https://www.urbs.fr/onb/
To access the service, navigate to https://www.urbs.fr/onb/. One is invited to register or use the online version. My team recommends registering. Note that the site is in French. Copying some text and data and shoving it into a free online translation service like Google’s may not be particularly helpful. French is one of the languages that Google usually handles with reasonable facilities. For this site, Google Translate comes up with tortured and off-base translations.
Stephen E Arnold, February 23, 2024
Meta Never Met a Kid Data Set It Did Not Find Useful
January 5, 2024
This essay is the work of a dumb dinobaby. No smart software required.
Adults are ripe targets for data exploitation in modern capitalism. While adults fight for their online privacy, most have rolled over and accepted the inevitable consumer Big Brother. When big tech companies go after monetizing kids, however, that’s when adults fight back like rabid bears. Engadget writes about how Meta is fighting against the federal government about kids’ data: “Meta Sues FTC To Block New Restrictions On Monetizing Kids’ Data.”
Meta is taking the FTC to court to prevent them from reopening a 2020 $5 billion landmark privacy case and to allow the company to monetize kids’ data on its apps. Meta is suing the FTC, because a federal judge ruled that the FTC can expand with new, more stringent rules about how Meta is allowed to conduct business.
Meta claims the FTC is out for a power grab and is acting unconstitutionally, while the FTC reports the claimants consistently violates the 2020 settlement and the Children’s Online Privacy Protection Act. FTC wants its new rules to limit Meta’s facial recognition usage and initiate a moratorium on new products and services until a third party audits them for privacy compliance.
Meta is not a huge fan of the US Federal Trade Commission:
“The FTC has been a consistent thorn in Meta’s side, as the agency tried to stop the company’s acquisition of VR software developer Within on the grounds that the deal would deter "future innovation and competitive rivalry." The agency dropped this bid after a series of legal setbacks. It also opened up an investigation into the company’s VR arm, accusing Meta of anti-competitive behavior."
The FTC is doing what government agencies are supposed to do: protect its citizens from greedy and harmful practices like those from big business. The FTC can enforce laws and force big businesses to pay fines, put leaders in jail, or even shut them down. But regulators have been decades ramping up to take meaningful action. The result? The thrashing over kiddie data.
Whitney Grace, January 5, 2024
Data Mesh: An Innovation or a Catchphrase?
October 18, 2023
Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.
Have you ever heard of data mesh? It’s a concept that has been around the tech industry for a while but is gaining more traction through media outlets. Most of the hubbub comes from press releases, such as TechCrunch’s: “Nextdata Is Building Data Mesh For Enterprise.”
Data mesh can be construed as a data platform architecture that allows users to access information where it is. No transferring of the information to a data lake or data warehouse is required. A data lake is a centralized, scaled data storage repository, while a data warehouse is a traditional enterprise system that analyzes data from different sources which may be local or remote.
Nextdata is a data mesh startup founded by Zhamek Dehghani. Nextdata is a “data-mesh-native” platform to design, share, create, and apply data products for analytics. Nextdata is directly inspired by Dehghani’s work at Thoughtworks. Instead of building storing and using data/metadata in single container, Dehghani built a mesh system. How does the NextData system work?
“Every Nextdata data product container has data governance policies ‘embedded as code.’ These controls are applied from build to run time, Dehghani says, and at every point at which the data product is stored, accessed or read. ‘Nextdata does for data what containers and web APIs do for software,’ she added. ‘The platform provides APIs to give organizations an open standard to access data products across technologies and trust boundaries to run analytical and machine-learning workloads ‘distributedly.’ (sic) Instead of requiring data consumers to copy data for reprocessing, Nextdata APIs bring processing to data, cutting down on busy work and reducing data bloat.’’
NextData received $12 million in seed investment to develop her system’s tooling and hire more people for the product, engineering, and marketing teams. Congratulations on the funding. It is not clear at this time that the approach will add latency to operations or present security issues related to disparate users’ security levels.
Whitney Grace, October 18, 2023
Scinapse Is A Free Academic-Centric Database
July 11, 2023
Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.
Quality academic worthy databases are difficult to locate outside of libraries and schools. Google Scholar attempted to qualify as an alternative to paywalled databases, but it returns repetitive and inaccurate results. Thanks to AI algorithms, free databases improved, such as Scinapse.
Scinapse is designed by Pluto and it is advertised as the “researcher’s favorite search engine. Scinapse delivers accurate and updated research materials in each search. Many free databases pull their results from old citations and fail to include recent publications. Pluto promises Scinapse delivers high-performing results due to its original algorithm optimized for research.
The algorithm returns research materials based on when it was published, how many times it was citied, and how impactful a paper was in notable journals. Scinapse consistently delivers results that are better than Google Scholar. Each search item includes a complete citation for quick reference. The customized filters offer the typical ways to narrow or broaden results, including journal, field of study, conference, author, publication year, and more.
People can also create an account to organize their research in reading lists, share with other scholars, or export as a citation list. Perhaps the most innovative feature is the paper recommendations where Scinapse sends paper citations that align with research. Scinapse aggregates over 48,000 journals. There are users in 196 countries and 1,130 reputable affiliations. Scinapse’s data sources include Microsoft Research, PubMed, Semantic Scholar, and Springer Nature.
Whitney Grace, July 11, 2023
Google and Its Use of the Word “Public”: A Clever and Revenue-Generating Policy Edit
July 6, 2023
Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.
If one has the cash, one can purchase user-generated data from more than 500 data publishers in the US. Some of these outfits are unknown. When a liberal Wall Street Journal reporter learns about Venntel or one of these outfits, outrage ensues. I am not going to explain how data from a user finds its ways into the hands of a commercial data aggregator or database publisher. Why not Google it? Let me know how helpful that research will be.
Why are these outfits important? The reasons include:
- Direct from app information obtained when a clueless mobile user accepts the Terms of Use. Do you hear the slurping sounds?
- Organizations with financial data and savvy data wranglers who cross correlate data from multiple sources?
- Outfits which assemble real-time or near-real-time user location data. How useful are those data in identifying military locations with a population of individuals who exercise wearing helpful heart and step monitoring devices?
Navigate to “Google’s Updated Privacy Policy States It Can Use Public Data to Train its AI Models.” The write up does not make clear what “public data” are. My hunch is that the Google is not exceptionally helpful with its definitions of important “obvious” concepts. The disconnect is the point of the policy change. Public data or third-party data can be purchased, licensed, used on a cloud service like an Oracle-like BlueKai clone, or obtained as part of a commercial deal with everyone’s favorite online service LexisNexis or one of its units.
A big advertiser demonstrates joy after reading about Google’s detailed prospect targeting reports. Dossiers of big buck buyers are available to those relying on Google for online text and video sales and marketing. The image of this happy media buyer is from the elves at MidJourney.
The write up states with typical Silicon Valley “real” news flair:
By updating its policy, it’s letting people know and making it clear that anything they publicly post online could be used to train Bard, its future versions and any other generative AI product Google develops.
Okay. “the weekend” mentioned in the write up is the 4th of July weekend. Is this a hot news or a slow news time? If you picked “hot”, you are respectfully wrong.
Now back to “public.” Think in terms of Google’s licensing third-party data, cross correlating those data with its log data generated by users, and any proprietary data obtained by Google’s Android or Chrome software, Gmail, its office apps, and any other data which a user clicking one of those “Agree” boxes cheerfully mouses through.
The idea, if the information in Google patent US7774328 B2. What’s interesting is that this granted patent does not include a quite helpful figure from the patent application US2007 0198481. Here’s the 16 year old figure. The subject is Michael Jackson. The text is difficult to read (write your Congressman or Senator to complain). The output is a machine generated dossier about the pop star. Note that it includes aliases. Other useful data are in the report. The granted patent presents more vanilla versions of the dossier generator, however.
The use of “public” data may enhance the type of dossier or other meaty report about a person. How about a map showing the travels of a person prior to providing a geo-fence about an individual’s location on a specific day and time. Useful for some applications? If these “inventions” are real, then the potential use cases are interesting. Advertisers will probably be interested? Can you think of other use cases? I can.
The cited article focuses on AI. I think that more substantive use cases fit nicely with the shift in “policy” for public data. Have your asked yourself, “What will Mandiant professionals find interesting in cross correlated data?”
Stephen E Arnold, July 6, 2023
Synthetic Data: Yes, They Are a Thing
March 13, 2023
“Real” data — that is, data generated by humans — are expensive to capture, normalize, and manipulate. But, those “real” data are important. Unfortunately some companies have sucked up real data and integrated those items into products and services. Now regulators are awakening from a decades-long slumber and taking a look into the actions of certain data companies. More importantly, a few big data outfits are aware of the [a] the costs and [b] the risks of real data.
Enter synthetic data.
If you are unfamiliar with the idea, navigate to “What is Synthetic Data? The Good, the Bad, and the Ugly.” The article states:
The privacy engineering community can help practitioners and stakeholders identify the use cases where synthetic data can be used safely, perhaps even in a semi-automated way. At the very least, the research community can provide actionable guidelines to understand the distributions, types of data, tasks, etc. where we could achieve reasonable privacy-utility tradeoffs via synthetic data produced by generative models.
Helpful, correct?
The article does not point out two things which I find of interest.
First, the amount of money a company can earn by operating efficient synthetic data factories is likely to be substantial. Like other digital products, the upside can be profitable and give the “owner” of the synthetic data market and IBM-type of old-school lock in.
Second, synthetic data can be weaponized either intentionally via data poisoning or algorithm shaping.
I just wanted to point out that a useful essay does not explore what may be two important attributes of synthetic data. Will regulators rise to the occasion? Unlikely.
Stephen E Arnold, March 13, 2023
Amazon Data Sets
February 21, 2023
Do you want to obtain data sets for analysis or making smart software even more crafty? Navigate to the AWS Marketplace. This Web page makes it easy to search through the more than 350 data products on offer. There is a Pricing Model check box. Click it if you want to see the no-cost data sets. There are some interesting options in the left side Refine Results area. For example, there are 366 open data licenses available. I find this interesting because when I examined the page, there were 362 data products. What are the missing four? I noted that there are 2,340 “standard data subscription agreements.” Again the difference between the 366 on offer and the 2,340 is interesting. A more comprehensive listing of data sources appears in the PrivacyRights’ listing. With some sleuthing, you may be able to identify other, lower profile ways to obtain data too. I am not willing to add some color about these sources in this free blog post.
Stephen E Arnold, February 21, 2023