Facing an Information Drought Tech Feudalists Will Innovate

December 18, 2023

This essay is the work of a dumb dinobaby. No smart software required.

The Exponential View (Azeem Azhar) tucked an item in his “blog.” The item is important, but I am not familiar with the cited source of the information in “LLMs May Soon Exhaust All Available High Quality Language Data for Training.” The main point is that the go-to method for smart software requires information in volume to [a] be accurate, [b] remain up to date, and [c] sufficiently useful to pay for the digital plumbing.

Oh, oh. The water cooler is broken. Will the Pilates’ teacher ask the students to quench their thirst with synthetic water? Another option is for those seeking refreshment to rejuvenate tired muscles with more efficient metabolic processes. The students are not impressed with these ideas? Thanks, MSFT Copilot. Two tries and close enough.

One datum indicates / suggests that the Big Dogs of AI will run out of content to feed into their systems in either 2024 or 2025. The date is less important than the idea of a hard stop.

What will the AI companies do? The essay asserts:

OpenAI has shown that it’s willing to pay eight figures annually for historical and ongoing access to data — I find it difficult to imagine that open-source builders will…. here are ways other than proprietary data to improve models, namely synthetic data, data efficiency, and algorithmic improvements – yet it looks like proprietary data is a moat open-source cannot cross.

Several observations:

New methods of “information” collection will be developed and deployed. Some of these will be “off the radar” of users by design. One possibility is mining the changes to draft content is certain systems. Changes or deltas can be useful to some analysts.
The synthetic data angle will become a go-to method using data sources which, by themselves, are not particularly interesting. However, when cross correlated with other information, “new” data emerge. The new data can be aggregated and fed into other smart software.
Rogue organizations will acquire proprietary data and “bitwash” the information. Like money laundering systems, the origin of the data are fuzzified or obscured, making figuring out what happened expensive and time consuming.
Techno feudal organizations will explore new non commercial entities to collect certain data; for example, the non governmental organizations in a niche could be approached for certain data provided by supporters of the entity.

Net net: Running out of data is likely to produce one high probability event: Certain companies will begin taking more aggressive steps to make sure their digital water cooler is filled and working for their purposes.

Stephen E Arnold, December 18, 2023

Written by Stephen E. Arnold · Filed Under AI, Business strategy, News

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.