Hey, Creatives, You Are Marginalized. Embrace It

June 20, 2025

Considerations of right and wrong or legality are outdated, apparently. Now, it is about what is practical and expedient. The Times of London reports, “Nick Clegg: Artists’ Demands Over Copyright are Unworkable.” Clegg is both a former British deputy prime minister and former Meta executive. He spoke as the UK’s parliament voted down measures that would have allowed copyright holders to see when their work had been used and by whom (or what). But even that failed initiative falls short of artists’ demands. Writer Lucy Bannerman tells us:

“Leading figures across the creative industries, including Sir Elton John and Sir Paul McCartney, have urged the government not to ‘give our work away’ at the behest of big tech, warning that the plans risk destroying the livelihoods of 2.5 million people who work in the UK’s creative sector. However, Clegg said that their demands to make technology companies ask permission before using copyrighted work were unworkable and ‘implausible’ because AI systems are already training on vast amounts of data. He said: ‘It’s out there already.’”

How convenient. Clegg did say artists should be able to opt out of AI being trained on their works, but insists making that the default option is just too onerous. Naturally, that outweighs the interests of a mere 2.5 million UK creatives. Just how should artists go about tracking down each AI model that might be training on their work and ask them to please not? Clegg does not address that little detail. He does state:

“‘I just don’t know how you go around, asking everyone first. I just don’t see how that would work. And by the way if you did it in Britain and no one else did it, you would basically kill the AI industry in this country overnight. … I think expecting the industry, technologically or otherwise, to preemptively ask before they even start training — I just don’t see. I’m afraid that just collides with the physics of the technology itself.’”

The large technology outfits with the DNA of Silicon Valley has carried the day. So output and be quiet. (And don’t think any can use Mickey Mouse art. Different rules are okay.)

Cynthia Murrell, June 20, 2025

Written by Stephen E. Arnold · Filed Under Business strategy, Copyright, Legal matters, News | Leave a Comment

India: Fair Use Can Squeeze YouTubers

June 5, 2025

Asian News International (ANI) seems to be leveraging the vagueness of India’s fair-use definition with YouTube’s draconian policies to hold content creators over a barrel. The Reporters’ Collective declares, “ANI Finds Business Niche in Copyright Claims Against YouTubers.” Writer Ayushi Kar recounts the story of Sumit, a content creator ANI accused of copyright infringement. The news agency reported more than three violations at once, a move that triggered an automatic takedown of those videos. Worse, it gave Sumit just a week to make good with ANI or lose his channel for good. To save his livelihood, he forked over between 1,500,000 and 1,800,000 rupees (about $17,600 – $21,100) for a one-year access license. We learn:

“Sumit isn’t the lone guy facing the aggressive copyright claims of ANI, which has adopted a new strategy to punitively leverage YouTube’s copyright policies in India to generate revenue. Using the death clause in YouTube policy and India’s vague provisions for fair use of copyrighted material, ANI is effectively forcing YouTube creators to buy expensive year-long licenses. The agency’s approach is to negotiate pricey licensing deals with YouTubers, including several who are strong critics of the BJP, even as YouTube holds a sword over the content producer’s channel for multiple claims of copyright violation.”

See the write-up for more examples of content creators who went through an ANI shake down. Kar continues:

“While ANI might be following a business it understands to be legal and fair, the episode has raised larger concern about copyright laws and the fair use rights in India by content producers who are worried about being squeezed out of their livelihoods – sometimes wiping out years of labor to build a community – between YouTube’s policies and copyright owners willingness to play hardball.”

What a cute tactic. Will it come to the US? Is it already here? YouTubers, feel free to comment. There is something special about India’s laws, though, that might make this scheme especially profitable there. Kar tells us:

“India’s Copyright Act 1957 allows … use of copyrighted material without the copyright owner’s permission for purposes such as criticism, comment, news, reporting and many more. In practice, there is a severe lack of specificity in law and regulations about how fair use doctrine is to be practiced.”

That means the courts decide what fair use means case by case. Bringing one’s case to court is, of course, expensive and time consuming, and victory is far from assured. It is no wonder content creators feel they must pay up. It would be a shame if something happened to that channel.

Cynthia Murrell, June 5, 2025

Written by Stephen E. Arnold · Filed Under Copyright, Legal matters, News | Leave a Comment

Meta and Torrents: True, False, or Rationalization?

February 26, 2025

AIs gobble datasets for training. It is another fact that many LLMs and datasets contain biased information, are incomplete, or plain stink. One ethical but cumbersome way to train algorithms would be to notify people that their data, creative content, or other information will be used to train AI. Offering to pay for the right to use the data would be a useful step some argue.

Will this happen? Obviously not.

Why?

Because it’s sometimes easier to take instead of asking. According to Toms Hardware, “Meta Staff Torrented Nearly 82TB Of Pirated Books For AI Training-Court Records Reveal Copyright Violations.” The article explains that Meta pirated 81.7 TB of books from the shadow libraries Anna’s Archive, Z-Library, and LibGen. These books were then used to train AI models. Meta is now facing a class action lawsuit about using content from the shadow libraries.

The allegations arise from Meta employees’ written communications. Some of these messages provide insight into employees’ concern about tapping pirated materials. The employees were getting frown lines, but then some staffers’ views rotated when they concluded smart software helped people access information.

Here’s a passage from the cited article I found interesting:

“Then, in January 2023, Mark Zuckerberg himself attended a meeting where he said, “We need to move this stuff forward… we need to find a way to unblock all this.” Some three months later, a Meta employee sent a message to another one saying they were concerned about Meta IP addresses being used “to load through pirate content.” They also added, “torrenting from a corporate laptop doesn’t feel right,” followed by laughing out loud emoji. Aside from those messages, documents also revealed that the company took steps so that its infrastructure wasn’t used in these downloading and seeding operations so that the activity wouldn’t be traced back to Meta. The court documents say that this constitutes evidence of Meta’s unlawful activity, which seems like it’s taking deliberate steps to circumvent copyright laws.”

If true, the approach smacks of that suave Silicon Valley style. If false, my faith in a yacht owner with gold chains might be restored.

Whitney Grace, February 26, 2025

Written by Stephen E. Arnold · Filed Under AI, Business strategy, Copyright, News | Comments Off on Meta and Torrents: True, False, or Rationalization?

Paywalls: New Angles for Bad Actors

January 2, 2025

Information literacy is more important now than ever, especially as people become more polarized in their views. This is due to multiple factors such as the news media chasing profits, bad actors purposefully spreading ignorance, and algorithms that feed people confirmation biased information. Thankfully there are people like Isabella Bruno, who leads the Smithsonian’s Learning and Community department, part of the Office of Digital Transformation. She’s dedicated to learning and on her Notion she brags…er…shares that she has access to journals and resources that are otherwise locked behind paywalls.

For her job, Bruno uses a lot of academic resources, but she knows that everyone doesn’t have the same access as her. She wrote the following resource to help her fellow learning enthusiasts and researchers: How Can I Access Research Resources When Not Attached To An Academic Institution?

Bruno shares a flow chart that explains how to locate resources. If the item is a book, she recommends using LibGen, Z-Library, and BookSC. She forgets to try the Internet Archive and inter-library loans. If the source is a book, she points towards OA.mg and trying PaperPanda. It is a Chrome extension that accesses papers. She also suggests Unpaywall, another Chrome extension, that searches for the desired paper.

When in further doubt, Bruno recommends Sci-Hub or the subreddit /r/Scholar, where users exchange papers. Her best advice is directly emailing the author, but

“Sometimes you might not get a response. This is because early-career researchers (who do most of the hard work) are the most likely to reply, but the corresponding author (i.e. the author with the email address on the paper) is most likely faculty and their inboxes will often be far too full to respond to these requests. The sad reality is that you’re probably not going to get a response if you’re emailing a senior academic. 100% agree. Also, unless the paper just dropped, there’s no guarantee that any of the authors are still at that institution. Academic job security is a fantasy and researchers change institutions often, so a lot of those emails are going off into the aether.”

Bruno needs to tell people to go to their local university or visit a public library! They know how to legally get around copyright.

Whitney Grace, January 2, 2025

Written by Stephen E. Arnold · Filed Under Copyright, Financial, News | Comments Off on Paywalls: New Angles for Bad Actors

The Future of Copyright: AI + Bots = Surprise. Disappeared Mario Content.

October 4, 2024

This essay is the work of a dumb dinobaby. No smart software required.

Did famously litigious Nintendo hire “brand protection” firm Tracer to find and eliminate AI-made Mario mimics? According to The Verge, “An AI-Powered Copyright Tool Is Taking Down AI-Generated Mario Pictures.” We learn the tool went on a rampage through X, filing takedown notices for dozens of images featuring the beloved Nintendo character. Many of the images were generated by xAI’s Grok AI tool, which is remarkably cavalier about infringing (or offensive) content. But some seem to have been old-school fan art. (Whether noncommercial fan art is fair use or copyright violation continues to be debated.) Verge writer and editor Wes Davis reports:

“The company apparently used AI to identify the images and serve takedown notices on behalf of Nintendo, hitting AI-generated images as well as some fan art. The Verge’s Tom Warren received an X notice that some content from his account was removed following a Digital Millennium Copyright Act (DMCA) complaint issued by a ‘customer success manager’ at Tracer. Tracer offers AI-powered services to companies, purporting to identify trademark and copyright violations online. The image in question, shown above, was a Grok-generated picture of Mario smoking a cigarette and drinking an oddly steaming beer.”

Navigate to the post to see the referenced image, where the beer does indeed smoke but the ash-laden cigarette does not. Davis notes the rest of the posts are, of course, no longer available to analyze. However, some users have complained their original fan art was caught in the sweep. We learn:

“One of the accounts that was listed in the DMCA request, OtakuRockU, posted that they were warned their account could be terminated over ‘a drawing of Mario,’ while another, PoyoSilly, posted an edited version of a drawing they said was identified in a notice. (The new one had a picture of a vaguely Mario-resembling doll inserted over a part of the image, obscuring the original part containing Mario.)”

Since neither Nintendo nor Tracer responded to Davis’ request for comment, he could not confirm Tracer was acting at the game company’s request. He is not, however, ready to let the matter go: The post closes with a request for readers to contact him if they had a Mario image taken down, whether AI-generated or not. See the post for that contact information, if applicable.

Cynthia Murrell, October 4, 2024

Written by Stephen E. Arnold · Filed Under Business process, Copyright, News | Comments Off on The Future of Copyright: AI + Bots = Surprise. Disappeared Mario Content.

Stop Indexing! And Pay Up!

July 17, 2024

This essay is the work of a dinobaby. Unlike some folks, no smart software improved my native ineptness.

I read “Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI.” The write up appears in two online publications, presumably to make an already contentious subject more clicky. The assertion in the title is the equivalent of someone in Salem, Massachusetts, pointing at a widower and saying, “She’s a witch.” Those willing to take the statement at face value would take action. The “trials” held in colonial Massachusetts. My high school history teacher was a witchcraft trial buff. (I think his name was Elmer Skaggs.) I thought about his descriptions of the events. I recall his graphic depictions and analysis of what I recall as “dunking.” The idea was that if a person was a witch, then that person could be immersed one or more times. I think the idea had been popular in medieval Europe, but it was not a New World innovation. Me-too is a core way to create novelty. The witch could survive being immersed for a period of time. With proof, hanging or burning were the next step. The accused who died was obviously not a witch. That’s Boolean logic in a pure form in my opinion.

The Library in Alexandria burns in front of people who wanted to look up information, learn, and create more information. Tough. Once the cultural institution is gone, just figure out the square root of two yourself. Thanks, MSFT Copilot. Good enough.

The accusations and evidence in the article depict companies building large language models as candidates for a test to prove that they have engaged in an improper act. The crime is processing content available on a public network, indexing it, and using the data to create outputs. Since the late 1960s, digitizing information and making it more easily accessible was perceived as an important and necessary activity. The US government supported indexing and searching of technical information. Other fields of endeavor recognized that as the volume of information expanded, the traditional methods of sitting at a table, reading a book or journal article, making notes, analyzing the information, and then conducting additional research or writing a technical report was simply not fast enough. What worked in a medieval library was not a method suited to put a satellite in orbit or perform other knowledge-value tasks.

Thus, online became a thing. Remember, we are talking punched cards, mainframes, and clunky line printers one day there was the Internet. The interest in broader access to online information grew and by 1985, people recognized that online access was useful for many tasks, not just looking up information about nuclear power technologies, a project I worked on in the 1970s. Flash forward 50 years, and we are upon the moment one can read about the “fact” that Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI.

The write up says:

AI companies are generally secretive about their sources of training data, but an investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube’s rules against harvesting materials from the platform without permission. Our investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, Nvidia, Apple, and Salesforce.

I understand the surprise some experience when they learn that a software script visits a Web site, processes its content, and generates an index (a buzzy term today is large language model, but I prefer the simpler word index.)

I want to point out that for decades those engaged in making information findable and accessible online have processed content so that a user can enter a query and get a list of indexed items which match that user’s query. In the old days, one used Boolean logic which we met a few moments ago. Today a user’s query (the jazzy term is prompt now) is expanded, interpreted, matched to the user’s “preferences”, and a result generated. I like lists of items like the entries I used to make on a notecard when I was a high school debate team member. Others want little essays suitable for a class assignment on the Salem witchcraft trials in Mr. Skaggs’s class. Today another system can pass a query, get outputs, and then take another action. This is described by the in-crowd as workflow orchestration. Others call it, “taking a human’s job.”

My point is that for decades, the index and searching process has been without much innovation. Sure, software scripts can know when to enter a user name and password or capture information from Web pages that are transitory, disappearing in the blink of an eye. But it is still indexing over a network. The object remains to find information of utility to the user or another system.

The write up reports:

Proof News contributor Alex Reisner obtained a copy of Books3, another Pile dataset and last year published a piece in The Atlantic reporting his finding that more than 180,000 books, including those written by Margaret Atwood, Michael Pollan, and Zadie Smith, had been lifted. Many authors have since sued AI companies for the unauthorized use of their work and alleged copyright violations. Similar cases have since snowballed, and the platform hosting Books3 has taken it down. In response to the suits, defendants such as Meta, OpenAI, and Bloomberg have argued their actions constitute fair use. A case against EleutherAI, which originally scraped the books and made them public, was voluntarily dismissed by the plaintiffs. Litigation in remaining cases remains in the early stages, leaving the questions surrounding permission and payment unresolved. The Pile has since been removed from its official download site, but it’s still available on file sharing services.

The passage does a good job of making clear that most people are not aware of what indexing does, how it works, and why the process has become a fundamental component of many, many modern knowledge-centric systems. The idea is to find information of value to a person with a question, present relevant content, and enable the user to think new thoughts or write another essay about dead witches being innocent.

The challenge today is that anyone who has written anything wants money. The way online works is that for any single user’s query, the useful information constitutes a tiny, miniscule fraction of the information in the index. The cost of indexing and responding to the query is high, and those costs are difficult to control.

But everyone has to be paid for the information that individual “created.” I understand the idea, but the reality is that the reason indexing, search, and retrieval was invented, refined, and given numerous life extensions was to perform a core function: Answer a question or enable learning.

The write up makes it clear that “AI companies” are witches. The US legal system is going to determine who is a witch just like the process in colonial Salem. Several observations are warranted:

Modifying what is a fundamental mechanism for information retrieval may be difficult to replace or re-invent in a quick, cost-efficient, and satisfactory manner. Digital information is loosey goosey; that is, it moves, slips, and slides either by individual’s actions or a mindless system’s.
Slapping fines and big price tags on what remains an access service will take time to have an impact. As the implications of the impact become more well known to those who are aggrieved, they may find that their own information is altered in a fundamental way. How many research papers are “original”? How many journalists recycle as a basic work task? How many children’s lives are lost when the medical reference system does not have the data needed to treat the kid’s problem?
Accusing companies of behaving improperly is definitely easy to do. Many companies do ignore rules, regulations, and cultural norms. Engineering Index’s publisher leaned that bootleg copies of printed Compendex indexes were available in China. What was Engineering Index going to do when I learned this almost 50 years ago? The answer was give speeches, complain to those who knew what the heck a Compendex was, and talk to lawyers. What happened to the Chinese content pirates? Not much.

I do understand the anger the essay expresses toward large companies doing indexing. These outfits are to some witches. However, if the indexing of content is derailed, I would suggest there are downstream consequences. Some of those consequences will make zero difference to anyone. A government worker at a national lab won’t be able to find details of an alloy used in a nuclear device. Who cares? Make some phone calls? Ask around. Yeah, that will work until the information is needed immediately.

A student accustomed to looking up information on a mobile phone won’t be able to find something. The document is a 404 or the information returned is an ad for a Temu product. So what? The kid will have to go the library, which one hopes will be funded, have printed material or commercial online databases, and a librarian on duty. (Good luck, traditional researchers.) A marketing team eager to get information about the number of Telegram users in Ukraine won’t be able to find it. The fix is to hire a consultant and hope those bright men and women have a way to get a number, a single number, good, bad, or indifferent.)

My concern is that as the intensity of the objections about a standard procedure for building an index escalate, the entire knowledge environment is put at risk. I have worked in online since 1962. That’s a long time. It is amazing to me that the plumbing of an information economy has been ignored for a long time. What happens when the companies doing the indexing go away? What happens when those producing the government reports, the blog posts, or the “real” news cannot find the information needed to create information? And once some information is created, how is another person going to find it. Ask an eighth grader how to use an online catalog to find a fungible book. Let me know what you learn? Better yet, do you know how to use a Remac card retrieval system?

The present concern about information access troubles me. There are mechanisms to deal with online. But the reason content is digitized is to find it, to enable understanding, and to create new information. Digital information is like gerbils. Start with a couple of journal articles, and one ends up with more journal articles. Kill this access and you get what you wanted. You know exactly who is the Salem witch.

Stephen E Arnold, July 17, 2024

Written by Stephen E. Arnold · Filed Under AI, Business process, Copyright, Indexing, News, Online (general) | Comments Off on Stop Indexing! And Pay Up!

Google: Another Unfair Allegation and You Are Probably Sorry

July 10, 2024

Just as some thought Google was finally playing nice with content rightsholders, a group of textbook publishers begs to differ—in court. TorrentFreak reports, “Google ‘Profits from Pirated Textbooks’ Publishers’ Lawsuit Claims.” The claimants accuse Google of not only ignoring textbook pirates in search results, but of actively promoting them to line its own coffers. Writer Andy Maxwell quotes the complaint:

“’Of course, Google’s Shopping Ads for Infringing Works … do not use photos of the pirates’ products; rather, they use unauthorized photos of the Publishers’ own textbooks, many of which display the Marks. Thus, with Infringing Shopping Ads, this “strong sense of the product” that Google is giving is a bait-and-switch,’ the complaint alleges.”

The complaint emphasizes Google actively creates, ranks, and targets ads for pirated products. It also assesses the quality of advertised sites. It is fishy, then, that infringing works often rank before or near ads for the originals.

In case one is still willing to give Google the benefit of the doubt, the complaint lists several reasons the company should know better. There are the sketchy site names like “Cheapbok,” and “Biz Ninjas.” Then there are the unrealistically low prices. A semester’s worth of textbooks should break the bank; that is just part of the college experience. Perhaps even more damning is Google’s own assertion it verifies sellers’ identities. The write-up continues:

“[The publishers] claim that verification means Google has the ability to communicate with sellers via email or verified phone numbers. In cases where Google was advised that a seller was offering pirated content and Google users were still able to place orders after clicking an ad, ‘Google had the ability to stop the direct infringement entirely.’ In the majority of cases where pirate sellers predominantly or exclusively use Google Ads to reach their customer base, terminating their accounts would’ve had a significant impact on future sales.”

No doubt. Publishers have tried to address the issue through Google’s stated process of takedown notices to no avail. In fact, they allege, the company is downright hostile to any that push the issue. We learn:

“When the publishers sent follow-up notices for matters previously reported but not handled to their satisfaction, ‘Google threatened on multiple occasions to stop reviewing all the Publishers’ notices for up to six months,’ the complaint alleges. Google’s response was due to duplicate requests; the company warned that if that happened three or more times on the same request, it would ‘consider that particular request to be manifestly unfounded’ which could lead the company to ‘temporarily stop reviewing your requests for a period of up to 180 days.’”

Ah, corporate logic. Will Google’s pirate booty be worth the legal headaches? The textbook publishers bringing suit include Cengage Learning, Macmillan Learning, Macmillan Holdings, LLC; Elsevier Inc., Elsevier B.V., and McGraw Hill LLC. The complaint was filed in the US District Court for the Southern District of New York.

Cynthia Murrell, July 10, 2024

Written by Stephen E. Arnold · Filed Under Copyright, Google, News | Comments Off on Google: Another Unfair Allegation and You Are Probably Sorry

Prediction: Next Target Up — Public Libraries

June 26, 2024

This essay is the work of a dinobaby. Unlike some folks, no smart software improved my native ineptness.

The publishers (in spirit at least) have kneecapped the Internet Archive. If you don’t know what the online service does or did, it does not matter. I learned from the estimable ShowBiz411.com site, a cultural treasure is gone. Forget digital books, the article “Paramount Erases Archives of MTV Website, Wipes Music, Culture History After 30 Plus Years” says:

Parent company Paramount, formerly Viacom, has tossed twenty plus years of news archives. All that’s left is a placeholder site for reality shows. The M in MTV – music — is gone, and so is all the reporting and all the journalism performed by music and political writers ever written. It’s as if MTV never existed. (It’s the same for VH1.com, all gone.)

Why? The write up couches the savvy business decision of the Paramount leadership this way:

There’s no precedent for this, and no valid reason. Just cheapness and stupidity.

Tibby, my floppy ear Frenchie, is listening to music from the Internet Archive. He knows the publishers removed 500,000 books. Will he lose access to his beloved early 20th century hill music? Will he ever be able to watch reruns of the rock the casbah music video? No. He is a risk. A threat. A despicable knowledge seeker. Thanks to myself for this nifty picture.

My knowledge of MTV and VH1 is limited. I do recall telling my children, “Would you turn that down, please?” What a waste of energy. Future students of American culture will have a void. I assume some artifacts of the music videos will remain. But the motherlode is gone. Is this a loss? On one hand, no. Thank goodness I will not have to glimpse performs rocking the casbah. On the other hand, yes. Archaeologists study bits of stone, trying to figure out how those who left them built Machu Pichu did it. The value of lost information to those in the future is tough to discuss. But knowledge products may be like mine tailings. At some point, a bright person can figure out how to extract trace elements in quantity.

I have a slightly different view of these two recent cultural milestones. I have a hunch that the publishers want to protect their intellectual property. Internet Archive rolled over because its senior executives learned from their lawyers that lawsuits about copyright violations would be tough to win. The informed approach was to delete 500,000 books. Imagine an online service like the Internet Archive trying to be a library.

That brings me to what I think is going on. Copyright litigation will make quite a lot of digital information disappear. That means that increasing fees to public libraries for digital copies of books to “loan” to patrons must go up. Libraries who don’t play ball may find that those institutions will be faced with other publisher punishments: No American Library Association after parties, no consortia discounts, and at some point no free books.

Yes, libraries will have to charge a patron to check out a physical book and then the “publishers” will get a percentage.

The Andrew Carnegie “free” thing is wrong. Libraries rip off the publishers. Authors may be mentioned, but what publisher cares about 99 percent of its authors? (I hear crickets.)

Several thoughts struck me as I was walking my floppy ear Frenchie:

The loss of information (some of which may have knowledge value) is no big deal in a social structure which does not value education. If people cannot read, who cares about books? Publishers and the wretches who write them. Period.
The video copyright timebomb of the Paramount video content has been defused. Let’s keep those lawyers at bay, please. Who will care? Nostalgia buffs and the parents of the “stars”?
The Internet Archive has music; libraries have music. Those are targets not on Paramount’s back. Who will shoot at these targets? Copyright litigators. Go go go.

Net net: My prediction is that libraries must change to a pay-to-loan model or get shut down. Who wants informed people running around disagreeing with lawyers, accountants, and art history majors?

Stephen E Arnold, June 26, 2024

Written by Stephen E. Arnold · Filed Under Business strategy, Copyright, Financial, News, Publishing | Comments Off on Prediction: Next Target Up — Public Libraries

Another Small Victory for OpenAI Against Authors

March 12, 2024

This essay is the work of a dumb dinobaby. No smart software required.

For those following the fight between human content creators and AI firms, score one for the algorithm engineers. TorrentFreak reports, “Court Dismisses Authors’ Copyright Infringement Claims Against OpenAI.” At issue is generative AI’s practice of feeding on humans’ work, without compensation, in order to mimic it. Multiple suits have been filed by record labels, writers, and visual artists. Reporter Ernesto Van der Sar writes:

“Several of the lawsuits filed by book authors include a piracy component. The cases allege that tech companies, including Meta and OpenAI, used the controversial Books3 dataset to train their models. The Books3 dataset was created by AI researcher Shawn Presser in 2020, who scraped the library of ‘pirate’ site Bibliotik. The general vision was that the plaintext collection of more than 195,000 books, which is nearly 37GB in size, could help AI enthusiasts build better models. The vision wasn’t wrong; large text archives are great training material for Large Language Models, but many authors disapprove of their works being used in this manner, without permission or compensation.”

A large group of rights holders have a football team. Those big folks are chasing the small but feisty opponent down the field. Which team will score? Thanks, MSFT Copilot. Keep up the good enough work.

Is that so unreasonable? Maybe not, but existing copyright law did not foresee this situation. We learn:

“After reviewing input from both sides, California District Judge Araceli Martínez-Olguín ruled on the matter. In her order, she largely sides with OpenAI. The vicarious copyright infringement claim fails because the court doesn’t agree that all output produced by OpenAI’s models can be seen as a derivative work. To survive, the infringement claim has to be more concrete.”

The plaintiffs are not out of moves, however. They can still file an amended complaint. But unless updated legislation is passed in the meantime, they may just be rebuffed again. So all they need is for Congress to act quickly to protect artists from tech firms. Any day now.

Cynthia Murrell, March 12, 2024

Written by Stephen E. Arnold · Filed Under Copyright, Legal matters, News | 1 Comment

Content Mastication: A Controversial Business Tactic

January 25, 2024

This essay is the work of a dumb dinobaby. No smart software required.

In the midst of the unfolding copyright issues, I found this post quite interesting. Torrent Freak published a story titled “Meta Admits Use of ‘Pirated’ Book Dataset to Train AI.” Is the story spot on? I sure don’t know. Nevertheless, the headline is a magnetic one. The story reports:

The cases allege that tech companies, including Meta and OpenAI, used the controversial Books3 dataset to train their models. The Books3 dataset has a clear piracy angle. It was created by AI researcher Shawn Presser in 2020, who scraped the library of ‘pirate’ site Bibliotik. This book archive was publicly hosted by digital archiving collective ‘The Eye‘ at the time, alongside various other data sources.

A combination of old-fashioned content collection and smart systems move information from Point A (a copyright owner’s night table) to a smart software system. MSFT’s second class Copilot Bing thing created this cartoon. Sigh. Not even good enough now in my opinion.

What was in the Books3 data collection? The TF story elucidates:

The general vision was that the plaintext collection of more than 195,000 books, which is nearly 37GB…

What did Meta allegedly do to make its Llama smarter than the average member of the Camelidae family? Let’s roll the TF quote:

Responding to a lawsuit from writer/comedian Sarah Silverman, author Richard Kadrey, and other rights holders, the tech giant admits that “portions of Books3” were used to train the Llama AI model before its public release. “Meta admits that it used portions of the Books3 dataset, among many other materials, to train Llama 1 and Llama 2,” Meta writes in its answer [to a court].

The article does not include any statements like “Thank you for the question” or “I don’t know. My team will provide the answer at the earliest possible moment.” Nope. Just an alleged admission.

How will the Meta and parallel copyright legal matter evolve? Beyond Search has zero clue. The US judicial system has deep and mysterious logic. One thing is certain: Senior executives do not like uncertainty and risk. The copyright litigation seems tailored to cause some techno feudalists to imagine a world in which laws, annoying regulators, and people yapping about intellectual property were nudged into a different line of work. One example which comes to mind is building secure bunkers or taking care of the lawn.

Stephen E Arnold, January 25, 2024

Written by Stephen E. Arnold · Filed Under AI, Copyright, News | 1 Comment

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.