Content Mastication: A Controversial Business Tactic

January 25, 2024

This essay is the work of a dumb dinobaby. No smart software required.

In the midst of the unfolding copyright issues, I found this post quite interesting. Torrent Freak published a story titled “Meta Admits Use of ‘Pirated’ Book Dataset to Train AI.” Is the story spot on? I sure don’t know. Nevertheless, the headline is a magnetic one. The story reports:

The cases allege that tech companies, including Meta and OpenAI, used the controversial Books3 dataset to train their models. The Books3 dataset has a clear piracy angle. It was created by AI researcher Shawn Presser in 2020, who scraped the library of ‘pirate’ site Bibliotik. This book archive was publicly hosted by digital archiving collective ‘The Eye‘ at the time, alongside various other data sources.

A combination of old-fashioned content collection and smart systems move information from Point A (a copyright owner’s night table) to a smart software system. MSFT’s second class Copilot Bing thing created this cartoon. Sigh. Not even good enough now in my opinion.

What was in the Books3 data collection? The TF story elucidates:

The general vision was that the plaintext collection of more than 195,000 books, which is nearly 37GB…

What did Meta allegedly do to make its Llama smarter than the average member of the Camelidae family? Let’s roll the TF quote:

Responding to a lawsuit from writer/comedian Sarah Silverman, author Richard Kadrey, and other rights holders, the tech giant admits that “portions of Books3” were used to train the Llama AI model before its public release. “Meta admits that it used portions of the Books3 dataset, among many other materials, to train Llama 1 and Llama 2,” Meta writes in its answer [to a court].

The article does not include any statements like “Thank you for the question” or “I don’t know. My team will provide the answer at the earliest possible moment.” Nope. Just an alleged admission.

How will the Meta and parallel copyright legal matter evolve? Beyond Search has zero clue. The US judicial system has deep and mysterious logic. One thing is certain: Senior executives do not like uncertainty and risk. The copyright litigation seems tailored to cause some techno feudalists to imagine a world in which laws, annoying regulators, and people yapping about intellectual property were nudged into a different line of work. One example which comes to mind is building secure bunkers or taking care of the lawn.

Stephen E Arnold, January 25, 2024

Written by Stephen E. Arnold · Filed Under AI, Copyright, News

Comments

One Response to “Content Mastication: A Controversial Business Tactic”

Another Small Victory for OpenAI Against Authors : Stephen E. Arnold @ Beyond Search on March 12th, 2024 5:13 am

[…] those following the fight between human content creators and AI firms, score one for the algorithm engineers. […]

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.