Another Small Victory for OpenAI Against Authors

March 12, 2024

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

For those following the fight between human content creators and AI firms, score one for the algorithm engineers. TorrentFreak reports, “Court Dismisses Authors’ Copyright Infringement Claims Against OpenAI.” At issue is generative AI’s practice of feeding on humans’ work, without compensation, in order to mimic it. Multiple suits have been filed by record labels, writers, and visual artists. Reporter Ernesto Van der Sar writes:

“Several of the lawsuits filed by book authors include a piracy component. The cases allege that tech companies, including Meta and OpenAI, used the controversial Books3 dataset to train their models. The Books3 dataset was created by AI researcher Shawn Presser in 2020, who scraped the library of ‘pirate’ site Bibliotik. The general vision was that the plaintext collection of more than 195,000 books, which is nearly 37GB in size, could help AI enthusiasts build better models. The vision wasn’t wrong; large text archives are great training material for Large Language Models, but many authors disapprove of their works being used in this manner, without permission or compensation.”

image

A large group of rights holders have a football team. Those big folks are chasing the small but feisty opponent down the field. Which team will score? Thanks, MSFT Copilot. Keep up the good enough work.

Is that so unreasonable? Maybe not, but existing copyright law did not foresee this situation. We learn:

“After reviewing input from both sides, California District Judge Araceli Martínez-Olguín ruled on the matter. In her order, she largely sides with OpenAI. The vicarious copyright infringement claim fails because the court doesn’t agree that all output produced by OpenAI’s models can be seen as a derivative work. To survive, the infringement claim has to be more concrete.”

The plaintiffs are not out of moves, however. They can still file an amended complaint. But unless updated legislation is passed in the meantime, they may just be rebuffed again. So all they need is for Congress to act quickly to protect artists from tech firms. Any day now.

Cynthia Murrell, March 12, 2024

Content Mastication: A Controversial Business Tactic

January 25, 2024

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

In the midst of the unfolding copyright issues, I found this post quite interesting. Torrent Freak published a story titled “Meta Admits Use of ‘Pirated’ Book Dataset to Train AI.” Is the story spot on? I sure don’t know. Nevertheless, the headline is a magnetic one. The story reports:

The cases allege that tech companies, including Meta and OpenAI, used the controversial Books3 dataset to train their models. The Books3 dataset has a clear piracy angle. It was created by AI researcher Shawn Presser in 2020, who scraped the library of ‘pirate’ site Bibliotik. This book archive was publicly hosted by digital archiving collective ‘The Eye‘ at the time, alongside various other data sources.

image

A combination of old-fashioned content collection and smart systems move information from Point A (a copyright owner’s night table) to a smart software system. MSFT’s second class Copilot Bing thing created this cartoon. Sigh. Not even good enough now in my opinion.

What was in the Books3 data collection? The TF story elucidates:

The general vision was that the plaintext collection of more than 195,000 books, which is nearly 37GB…

What did Meta allegedly do to make its Llama smarter than the average member of the Camelidae family? Let’s roll the TF quote:

Responding to a lawsuit from writer/comedian Sarah Silverman, author Richard Kadrey, and other rights holders, the tech giant admits that “portions of Books3” were used to train the Llama AI model before its public release. “Meta admits that it used portions of the Books3 dataset, among many other materials, to train Llama 1 and Llama 2,” Meta writes in its answer [to a court].

The article does not include any statements like “Thank you for the question” or “I don’t know. My team will provide the answer at the earliest possible moment.” Nope. Just an alleged admission.

How will the Meta and parallel copyright legal matter evolve? Beyond Search has zero clue. The US judicial system has deep and mysterious logic. One thing is certain: Senior executives do not like uncertainty and risk. The copyright litigation seems tailored to cause some techno feudalists to imagine a world in which laws, annoying regulators, and people yapping about intellectual property were nudged into a different line of work. One example which comes to mind is building secure bunkers or taking care of the lawn.

Stephen E Arnold, January 25, 2024

PicRights in the News: Happy Holidays

November 28, 2023

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

With the legal eagles cackling in their nests about artificial intelligence software using content without permission, the notion of rights enforcement is picking up steam. One niche in rights enforcement is the business of using image search tools to locate pictures and drawings which appear in blogs or informational Web pages.

StackOverflow hosts a thread by a developer who linked to or used an image more than a decade ago. On November 23, 2023, the individual queried those reading Q&A section about a problem. “Am I Liable for Re-Sharing Image Links Provided via the Stack Exchange API?”

image

The legal eagle jabs with his beak at the person who used an image, assuming it was open source. The legal eagle wants justice to matter. Thanks, MSFT Copilot. A couple of tries, still not on point, but good enough.

The explanation of the situation is interesting to me for three reasons: [a] The alleged infraction took place in 2010; [b] Stack Exchange is a community learning and sharing site which manifests some open sourciness; and [c] information about specific rights, content ownership, and data reached via links is not front and center.

Ignorance of the law, at least in some circles, is not excuse for a violation. The cited post reveals that an outfit doing business as PicRights wants money for the unlicensed use of an image or art in 2010 (at least that’s how I read the situation).

What’s interesting is the related data provided by those who are responding to the request for information; for example:

  • A law firm identified as Higbee & Asso. is the apparent pointy end of the spear pointed at the alleged offender’s wallet
  • A link to an article titled “Is PicRights a Scam? Are Higbee & Associates Emails a Scam?
  • A marketing type of write up called “How To Properly Deal With A PicRights Copyright Unlicensed Image Letter”.

How did the story end? Here’s what the person accused of infringing did:

According to Law I am liable. I have therefore decided to remove FAQoverflow completely, all 90k+ pages of it, and will negotiate with PicRights to pay them something less than the AU$970 that they are demanding.

What are the downsides to the loss of the FAQoverflow content? I don’t know. But I assume that the legal eagles, after gobbling one snack, are aloft and watching the AI companies. That’s where the big bucks will be. Legal eagles have a fondness for big bucks I believe.

Net net: Justice is served by some interesting birds, eagles and whatnot.

Stephen E Arnold, November 28, 2023

Microsoft, the Techno-Lord: Avoid My Galloping Steed, Please

November 27, 2023

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

The Merriam-Webster.com online site defines “responsibility” this way:

re·?spon·?si·?bil·?I·?ty

1 : the quality or state of being responsible: such as
: moral, legal, or mental accountability
: RELIABILITY, TRUSTWORTHINESS
: something for which one is responsible

The online sector has a clever spin on responsibility; that is, in my opinion, the companies have none. Google wants people who use its online tools and post content created with those tools to make sure that what the Google system outputs does not violate any applicable rules, regulations, or laws.

image

In a traditional fox hunt, the hunters had the “right” to pursue the animal. If a farmer’s daughter were in the way, it was the farmer’s responsibility to keep the silly girl out of the horse’s path. That will teach them to respect their betters I assume. Thanks, MSFT Copilot. I know you would not put me in a legal jeopardy, would you? Now what are the laws pertaining to copyright for a cartoon in Armenia? Darn, I have to know that, don’t I.

Such a crafty way of  defining itself as the mere creator of software machines has inspired Microsoft to follow a similar path. The idea is that anyone using Microsoft products, solutions, and services is “responsible” to comply with applicable rules, regulations, and laws.

Tidy. Logical. Complete. Just like a nifty algebra identity.

Microsoft Wants YOU to Be Sued for Copyright Infringement, Washes Its Hands of AI Copyright Misuse and Says Users Should Be Liable for Copyright Infringement” explains:

Microsoft believes they have no liability if an AI, like Copilot, is used to infringe on copyrighted material.

The write up includes this passage:

So this all comes down to, according to Microsoft, that it is providing a tool, and it is up to users to use that tool within the law. Microsoft says that it is taking steps to prevent the infringement of copyright by Copilot and its other AI products, however, Microsoft doesn’t believe it should be held legally responsible for the actions of end users.

The write up (with no Jimmy Kimmel spin) includes this statement, allegedly from someone at Microsoft:

Microsoft is willing to work with artists, authors, and other content creators to understand concerns and explore possible solutions. We have adopted and will continue to adopt various tools, policies, and filters designed to mitigate the risk of infringing outputs, often in direct response to the feedback of creators. This impact may be independent of whether copyrighted works were used to train a model, or the outputs are similar to existing works. We are also open to exploring ways to support the creative community to ensure that the arts remain vibrant in the future.

From my drafty office in rural Kentucky, the refusal to accept responsibility for its business actions, its products, its policies to push tools and services on users, and the outputs of its cloudy system is quite clever. Exactly how will a user of products pushed at users like Edge and its smart features prevent a user from acquiring from a smart Microsoft system something that violates an applicable rule, regulation, or law?

But legal and business cleverness is the norm for the techno-feudalists. Let the surfs deal with the body of the child killed when the barons chase a fox through a small leasehold. I can hear the brave royals saying, “It’s your fault. Your daughter was in the way. No, I don’t care that she was using the free Microsoft training materials to learn how to use our smart software.”

Yep, responsible. The death of the hypothetical child frees up another space in the training course.

Stephen E Arnold, November 27, 2023

Copyright Trolls: An Explanation Which Identifies Some Creatures

November 14, 2023

green-dino_thumb_thumbThis essay is the work of a dumb humanoid. No smart software required.

If you are not familiar with firms which pursue those who intentionally or unintentionally use another person’s work in their writings, you may not know what a “copyright troll” is. I want to point you to an interesting post from IntoTheMinds.com. The write up “PicRights + AFP: Une Opération de Copyright Trolling Bien Rodée.” appeared in 2021, and it was updated in June 2023. The original essay is in French, but you may want to give Google Translate a whirl if your high school French is but a memoire dou dou.

image

A copyright troll is looking in the window of a blog writer. The troll is waiting for the writer to use content covered by copyright and for which a fee must be paid. The troll is patient. The blog writer is clueless. Thanks, Microsoft Bing. Nice troll. Do you perhaps know one?

The write up does a good job of explaining trollism with particular reference to an estimable outfit called PicRights and the even more estimable Agence France-Presse. It also does a bit of critical review of the PicRights’ operation, including the use of language to alleged copyright violators about how their lives will take a nosedive if money is not paid promptly for the alleged transgression. There are some thoughts about what to do if and when a copyright troll like the one pictured courtesy of Microsoft Bing’s art generator. Some comments about the rules and regulations regarding trollism. The author includes a few observations about the rights of creators. And a few suggested readings are included. Of particular note is the discussion of an estimable legal eagle outfit doing business as Higbee and Associates. You can find that document at this link.

If you are interested in copyright trolling in general and PicRights in particular, I suggest you download the document. I am not sure how long it will remain online.

Stephen E Arnold, November 14, 2023

Getty and Its Licensed Smart Software Art

September 26, 2023

Vea4_thumb_thumb_thumb_thumb_thumb_t[1]Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid. (Yep, the dinobaby is back from France. Thanks to those who made the trip professionally and personally enjoyable.)

The illustration shows a very, very happy image rights troll. The cloud of uncertainty from AI generated images has passed. Now the rights software bots, controlled by cheerful copyright trolls, can scour the Web for unauthorized image use. Forget the humanoids. The action will be from tireless AI generators and equally robust bots designed to charge a fee for the image created by zeros and ones. Yes!

9 25 troll dancing

A quite joyful copyright troll displays his killer moves. Thanks, MidJourney. The gradient descent continues, right into the legal eagles’ nests.

Getty Made an AI Generator That Only Trained on Its Licensed Images” reports:

Generative AI by Getty Images (yes, it’s an unwieldy name) is trained only on the vast Getty Images library, including premium content, giving users full copyright indemnification. This means anyone using the tool and publishing the image it created commercially will be legally protected, promises Getty. Getty worked with Nvidia to use its Edify model, available on Nvidia’s generative AI model library Picasso.

This is exciting. Will the images include a tough-to-discern watermark? Will the images include a license plate, a social security number, or a just a nifty sting of harmless digits?

The article does reveal the money angle:

The company said any photos created with the tool will not be included in the Getty Images and iStock content libraries. Getty will pay creators if it uses their AI-generated image to train the current and future versions of the model. It will share revenues generated from the tool, “allocating both a pro rata share in respect of every file and a share based on traditional licensing revenue.”

Who will be happy? Getty, the trolls, or the designers who have a way to be more productive with a helping hand from the Getty robot? I think the world will be happier because monetization, smart software, and lawyers are a business model with legs… or claws.

Stephen E Arnold, September 26, 2023

Can Smart Software Get Copyright? Wrong?

September 15, 2023

It is official: copyrights are for humans, not machines. JD Supra brings us up to date on AI and official copyright guidelines in, “Using AI to Create a Work – Copyright Protection and Infringement.” The basic principle goes both ways. Creators cannot copyright AI-generated material unless they have manipulated it enough to render it a creative work. On the other hand, it is a violation to publish AI-generated content that resembles a copyright-protected work. As for feeding algorithms a diet of human-made media, that is not officially against the rules. Yet. We learn:

“To obtain copyright protection for a work containing AI-generated material, the work must have sufficient human input, such as sufficient modification of the AI output or the human selection or arrangement of the AI content. However, copyright protection would be limited to those ‘human-made’ elements. Past, pending, and future copyright applications need to identify explicitly the human element and disclaim the AI-created content if it is more than minor. For existing registrations, a supplementary registration may be necessary. Works created using AI are subject to the same copyright infringement analysis applicable to any work. The issue with using AI to create works is that the sources of the original works may not be identified, so an infringement analysis cannot be conducted until the cease-and-desist letter is received. No court has yet adopted the theory that merely using an AI database means the resulting work is automatically an infringing derivative work if it is not substantially similar to the protectable elements in the copyrighted work.”

The article cites the Copyright Registration Guidance: Works Containing Material Generated by Artificial Intelligence, 88 Fed. Reg. 16,190 (March 16, 2023). It notes those guidelines were informed by a decision handed down in February, Zarya v Dawn, which involved a comic book with AI-generated content. the Copyright Office sliced and diced elements, specifying:

“… The selection and arrangement of the images and the text were the result of human authorship and thus copyrightable, but the AI-generated images resulting from human prompts were not. The prompts ‘influenced,’ but did not ‘dictate,’ the resulting image, so the applicant was not the ‘mastermind’ and therefore not the author of the images. Further, the applicant’s edits to the images were too minor to be deemed copyrightable.”

Ah, the fine art of splitting hairs. As for training databases packed with protected content, the article points to pending lawsuits by artists against Stability AI, MidJourney, and Deviant Art. We are told those cases may be dismissed on technical grounds, but are advised to watch for similar cases in the future. Stay tuned.

Cynthia Murrell, September 15, 2023

Rights Issues: How Can Money Be Extracted from Content?

March 20, 2023

I don’t have a dog in this fight. I gave up on “real” publishers when the outfits with which I was working in Sweden and the UK went to the big printing press in the multiverse. Yep, failure. I am mindful about image rights too, but that doesn’t mean my Craiyon.com images or the clip art I have in my files from the years of CD-ROMs with illustrations that were “free to use.” Ho ho ho on that marketing blather.

I want to call attention to two news items and then offer a comment or two not presented by other dinobabies watching the wide, wild, wonderful world of digital information.

The first item is the Italian government’s conclusion that the illustration by Leonardo d Vinci is not in the public domain. I used to have a T shirt I bought in Florence with the image on the overpriced, made-in-China garment. I wonder if that shop on the bridge near the secret passage some big wheel used in the 16th century? I would assume that the Italian government has hoovered these and converted them to recycling fodder. You can read about this in the article “Italy Decides That Leonardo da Vinci’s 500 Year Old Works Are Not In The Public Domain.” The subtitle of the write up is “from the locking-up-in-the-public-domain department.” The story reports:

According to the Italian Cultural Heritage Code and relevant case law, faithful digital reproductions of works of cultural heritage — including works in the Public Domain — can only be used for commercial purposes against authorization and payment of a fee. Importantly though, the decision to require authorization and claim payment is left to the discretion of each cultural institution (see articles 107 and 108). In practice, this means that cultural institutions have the option to allow users to reproduce and reuse faithful digital reproductions of Public Domain works for free, including for commercial uses. This flexibility is fundamental for institutions to support open access to cultural heritage.

The operative word is “fee.”

The second item is about Internet Archive, a controversial outfit from the point of view of some publishers. The idea is that Internet Archive offers electronic books for free. Free, not fee, is an important concept. Publishers, writers, agents, book cover artists, and probably a French bulldog or two want to get a piece of the money generated by charging for electronic books. Look Amazon does it, and publishers are not thrilled. But there is some money paid out which is going the right direction.

The report I read is “The Internet Archive Is a Library.” Libraries and publishers have a long history. On one hand, publishers love to sell books to libraries. On the other hand, libraries are not turning cartwheels because libraries loan eBooks and other digital artifacts to patrons. As long as the money streams flow, publishers and rights holders are semi-happy, a bit like a black sheep of the family getting a few bucks when Uncle Tom goes to the big printing shop in the sky where my defunct publishers hopefully work setting type by hand.

The article says:

Despite its incredible library collections, which serve the needs of millions of people, Hachette Book Group, HarperCollins Publishers, John Wiley & Sons Inc., and Penguin Random House assert that the Internet Archive is not a real library.

If one is not a real library, that institution must pay for books. That seems clear to the publishers. I have wondered why the US Library of Congress was not moving in the same direction as the Internet Archive. Oh, well. What about the Special Library Association? Yeah, oh, well. And the American Library Association in concert with Harvard or Stanford? Oh, well.

So the Internet Archive is in jeopardy.

Several observations:

  1. Entities which could have assumed this job in concern with Internet Archive could have been more proactive. They weren’t, so here we are.
  2. Publishers are hungry for revenue, almost any type of revenue stream will do. Why not extract money from an outfit trying to perform a useful library-type function? Sorry, we want money and people can buy information from us summarizes the position of some publishers on earth and possibly in the big printing facility amidst the stars.
  3. Legal eagles love books. Plus those folks sometimes buy books to decorate their offices in the event a meeting is required in a suitably classy environment. Do lawyers read these books? Maybe, but I think professional publishers sell online content to them. Thus, in today’s world it makes sense for lawyers to determine what is a library and what is not, what content is free and which is not. I think I understand, but I am not going to call my attorney because I have to pay in 15 minute increments.

Net net: Libraries are for many negative spaces. Some books present information which is bad; therefore, ban or burn the books. Now we can defund regular libraries and shut down the online outfits. Publishers may be thrilled. Others may not care. I like libraries, but dinobabies don’t have influence. I am glad I am old.

Stephen E Arnold, March 20, 2023

RightHub: Will It Supercharge IP Protection and Violation Trolls?

March 16, 2023

Yahoo believe it or not displayed an article I found interesting. The title was “Copy That: RightHub Wants To Be the Command Center for Intellectual Property Management.” The story originated on a Silicon Valley “real news” site called TechCrunch.

The write up explains that managing patent, trademark, and copyright information is a hassle. RightHub is, according to the story:

…something akin to what GoDaddy promises in the world of website creation, insofar as GoDaddy allows anyone to search, register, and renew domain names, with additional tools for building and hosting websites.

I am not sure that a domain-name type of model is going to have the professional, high-brow machinery that rights-sensitive outfits expect. I am not sure that many people understand that the domain-name model is fraught with manipulated expiry dates, wheeling and dealing, and possibly good old-fashioned fraud.

The idea of using a database and scripts to keep track of intellectual property is interesting. Tools are available to automate many of the discrete steps required to file, follow up, renew, and remember who did what and when.

But domain name processes as a touchstone.

Sorry. I think that the service will embrace a number of sub functions which may be of interest to some people; for example, enforcement trolls. Many are using manual or outmoded tools like decades old image recognition technology and partial Web content scanning methods. If RightHub offers a robust system, IP protection may become easier. Some trolls will be among the first to seek inspiration and possibly opportunities to be more troll-like.

Stephen E Arnold, March 16, 2023

How to Make Chinese Artificial Intelligence Professionals Hope Like Happy Bunnies

January 23, 2023

Happy New Year! It is the Year of the Rabbit, and the write up “Is Copyright Easting AI?” may make some celebrants happier than the contents of a red envelop. The article explains that the US legal system may derail some of the more interesting, publicly accessible applications of smart software. Why? US legal eagles and the thicket of guard rails which comprise copyright.

The article states:

… neural network developers, get ready for the lawyers, because they are coming to get you.

That means the the interesting applications on the “look what’s new on the Internet” news service Product Hunt will disappear. Only big outfits can afford to bring and fight some litigation. When I worked as an expert witness, I learned that money is not an issue of concern for some of the parties to a lawsuit. Those working as a robot repair technician for a fast food chain will want to avoid engaging in a legal dispute.

The write up also says:

If the AI industry is to survive, we need a clear legal rule that neural networks, and the outputs they produce, are not presumed to be copies of the data used to train them. Otherwise, the entire industry will be plagued with lawsuits that will stifle innovation and only enrich plaintiff’s lawyers.

I liked the word “survive.” Yep, continue to exist. That’s an interesting idea. Let’s assume that the US legal process brings AI develop to a halt. Who benefits? I am a dinobaby living in rural Kentucky. Nevertheless, it seems to me that a country will just keep on working with smart software informed by content. Some of the content may be a US citizen’s intellectual property, possibly a hard drive with data from Los Alamos National Laboratory, or a document produced by a scientific and technical publisher.

It seems to me that smart software companies and research groups in a country with zero interest in US laws can:

  1. Continue to acquire content by purchase, crawling, or enlisting the assistance of third parties
  2. Use these data to update and refine their models
  3. Develop innovations not available to smart software developers in the US.

Interesting, and with the present efficiency of some legal and regulatory system, my hunch is that bunnies in China are looking forward to 2023. Will an innovator use enhanced AI for information warfare or other weapons? Sure.

Stephen E Arnold, January 23, 2023

Next Page »

  • Archives

  • Recent Posts

  • Meta