Stop Indexing! And Pay Up!

July 17, 2024

This essay is the work of a dinobaby. Unlike some folks, no smart software improved my native ineptness.

I read “Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI.” The write up appears in two online publications, presumably to make an already contentious subject more clicky. The assertion in the title is the equivalent of someone in Salem, Massachusetts, pointing at a widower and saying, “She’s a witch.” Those willing to take the statement at face value would take action. The “trials” held in colonial Massachusetts. My high school history teacher was a witchcraft trial buff. (I think his name was Elmer Skaggs.) I thought about his descriptions of the events. I recall his graphic depictions and analysis of what I recall as “dunking.” The idea was that if a person was a witch, then that person could be immersed one or more times. I think the idea had been popular in medieval Europe, but it was not a New World innovation. Me-too is a core way to create novelty. The witch could survive being immersed for a period of time. With proof, hanging or burning were the next step. The accused who died was obviously not a witch. That’s Boolean logic in a pure form in my opinion.

The Library in Alexandria burns in front of people who wanted to look up information, learn, and create more information. Tough. Once the cultural institution is gone, just figure out the square root of two yourself. Thanks, MSFT Copilot. Good enough.

The accusations and evidence in the article depict companies building large language models as candidates for a test to prove that they have engaged in an improper act. The crime is processing content available on a public network, indexing it, and using the data to create outputs. Since the late 1960s, digitizing information and making it more easily accessible was perceived as an important and necessary activity. The US government supported indexing and searching of technical information. Other fields of endeavor recognized that as the volume of information expanded, the traditional methods of sitting at a table, reading a book or journal article, making notes, analyzing the information, and then conducting additional research or writing a technical report was simply not fast enough. What worked in a medieval library was not a method suited to put a satellite in orbit or perform other knowledge-value tasks.

Thus, online became a thing. Remember, we are talking punched cards, mainframes, and clunky line printers one day there was the Internet. The interest in broader access to online information grew and by 1985, people recognized that online access was useful for many tasks, not just looking up information about nuclear power technologies, a project I worked on in the 1970s. Flash forward 50 years, and we are upon the moment one can read about the “fact” that Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI.

The write up says:

AI companies are generally secretive about their sources of training data, but an investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube’s rules against harvesting materials from the platform without permission. Our investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, Nvidia, Apple, and Salesforce.

I understand the surprise some experience when they learn that a software script visits a Web site, processes its content, and generates an index (a buzzy term today is large language model, but I prefer the simpler word index.)

I want to point out that for decades those engaged in making information findable and accessible online have processed content so that a user can enter a query and get a list of indexed items which match that user’s query. In the old days, one used Boolean logic which we met a few moments ago. Today a user’s query (the jazzy term is prompt now) is expanded, interpreted, matched to the user’s “preferences”, and a result generated. I like lists of items like the entries I used to make on a notecard when I was a high school debate team member. Others want little essays suitable for a class assignment on the Salem witchcraft trials in Mr. Skaggs’s class. Today another system can pass a query, get outputs, and then take another action. This is described by the in-crowd as workflow orchestration. Others call it, “taking a human’s job.”

My point is that for decades, the index and searching process has been without much innovation. Sure, software scripts can know when to enter a user name and password or capture information from Web pages that are transitory, disappearing in the blink of an eye. But it is still indexing over a network. The object remains to find information of utility to the user or another system.

The write up reports:

Proof News contributor Alex Reisner obtained a copy of Books3, another Pile dataset and last year published a piece in The Atlantic reporting his finding that more than 180,000 books, including those written by Margaret Atwood, Michael Pollan, and Zadie Smith, had been lifted. Many authors have since sued AI companies for the unauthorized use of their work and alleged copyright violations. Similar cases have since snowballed, and the platform hosting Books3 has taken it down. In response to the suits, defendants such as Meta, OpenAI, and Bloomberg have argued their actions constitute fair use. A case against EleutherAI, which originally scraped the books and made them public, was voluntarily dismissed by the plaintiffs. Litigation in remaining cases remains in the early stages, leaving the questions surrounding permission and payment unresolved. The Pile has since been removed from its official download site, but it’s still available on file sharing services.

The passage does a good job of making clear that most people are not aware of what indexing does, how it works, and why the process has become a fundamental component of many, many modern knowledge-centric systems. The idea is to find information of value to a person with a question, present relevant content, and enable the user to think new thoughts or write another essay about dead witches being innocent.

The challenge today is that anyone who has written anything wants money. The way online works is that for any single user’s query, the useful information constitutes a tiny, miniscule fraction of the information in the index. The cost of indexing and responding to the query is high, and those costs are difficult to control.

But everyone has to be paid for the information that individual “created.” I understand the idea, but the reality is that the reason indexing, search, and retrieval was invented, refined, and given numerous life extensions was to perform a core function: Answer a question or enable learning.

The write up makes it clear that “AI companies” are witches. The US legal system is going to determine who is a witch just like the process in colonial Salem. Several observations are warranted:

Modifying what is a fundamental mechanism for information retrieval may be difficult to replace or re-invent in a quick, cost-efficient, and satisfactory manner. Digital information is loosey goosey; that is, it moves, slips, and slides either by individual’s actions or a mindless system’s.
Slapping fines and big price tags on what remains an access service will take time to have an impact. As the implications of the impact become more well known to those who are aggrieved, they may find that their own information is altered in a fundamental way. How many research papers are “original”? How many journalists recycle as a basic work task? How many children’s lives are lost when the medical reference system does not have the data needed to treat the kid’s problem?
Accusing companies of behaving improperly is definitely easy to do. Many companies do ignore rules, regulations, and cultural norms. Engineering Index’s publisher leaned that bootleg copies of printed Compendex indexes were available in China. What was Engineering Index going to do when I learned this almost 50 years ago? The answer was give speeches, complain to those who knew what the heck a Compendex was, and talk to lawyers. What happened to the Chinese content pirates? Not much.

I do understand the anger the essay expresses toward large companies doing indexing. These outfits are to some witches. However, if the indexing of content is derailed, I would suggest there are downstream consequences. Some of those consequences will make zero difference to anyone. A government worker at a national lab won’t be able to find details of an alloy used in a nuclear device. Who cares? Make some phone calls? Ask around. Yeah, that will work until the information is needed immediately.

A student accustomed to looking up information on a mobile phone won’t be able to find something. The document is a 404 or the information returned is an ad for a Temu product. So what? The kid will have to go the library, which one hopes will be funded, have printed material or commercial online databases, and a librarian on duty. (Good luck, traditional researchers.) A marketing team eager to get information about the number of Telegram users in Ukraine won’t be able to find it. The fix is to hire a consultant and hope those bright men and women have a way to get a number, a single number, good, bad, or indifferent.)

My concern is that as the intensity of the objections about a standard procedure for building an index escalate, the entire knowledge environment is put at risk. I have worked in online since 1962. That’s a long time. It is amazing to me that the plumbing of an information economy has been ignored for a long time. What happens when the companies doing the indexing go away? What happens when those producing the government reports, the blog posts, or the “real” news cannot find the information needed to create information? And once some information is created, how is another person going to find it. Ask an eighth grader how to use an online catalog to find a fungible book. Let me know what you learn? Better yet, do you know how to use a Remac card retrieval system?

The present concern about information access troubles me. There are mechanisms to deal with online. But the reason content is digitized is to find it, to enable understanding, and to create new information. Digital information is like gerbils. Start with a couple of journal articles, and one ends up with more journal articles. Kill this access and you get what you wanted. You know exactly who is the Salem witch.

Stephen E Arnold, July 17, 2024

Written by Stephen E. Arnold · Filed Under AI, Business process, Copyright, Indexing, News, Online (general) | Leave a Comment

Google: Another Unfair Allegation and You Are Probably Sorry

July 10, 2024

Just as some thought Google was finally playing nice with content rightsholders, a group of textbook publishers begs to differ—in court. TorrentFreak reports, “Google ‘Profits from Pirated Textbooks’ Publishers’ Lawsuit Claims.” The claimants accuse Google of not only ignoring textbook pirates in search results, but of actively promoting them to line its own coffers. Writer Andy Maxwell quotes the complaint:

“’Of course, Google’s Shopping Ads for Infringing Works … do not use photos of the pirates’ products; rather, they use unauthorized photos of the Publishers’ own textbooks, many of which display the Marks. Thus, with Infringing Shopping Ads, this “strong sense of the product” that Google is giving is a bait-and-switch,’ the complaint alleges.”

The complaint emphasizes Google actively creates, ranks, and targets ads for pirated products. It also assesses the quality of advertised sites. It is fishy, then, that infringing works often rank before or near ads for the originals.

In case one is still willing to give Google the benefit of the doubt, the complaint lists several reasons the company should know better. There are the sketchy site names like “Cheapbok,” and “Biz Ninjas.” Then there are the unrealistically low prices. A semester’s worth of textbooks should break the bank; that is just part of the college experience. Perhaps even more damning is Google’s own assertion it verifies sellers’ identities. The write-up continues:

“[The publishers] claim that verification means Google has the ability to communicate with sellers via email or verified phone numbers. In cases where Google was advised that a seller was offering pirated content and Google users were still able to place orders after clicking an ad, ‘Google had the ability to stop the direct infringement entirely.’ In the majority of cases where pirate sellers predominantly or exclusively use Google Ads to reach their customer base, terminating their accounts would’ve had a significant impact on future sales.”

No doubt. Publishers have tried to address the issue through Google’s stated process of takedown notices to no avail. In fact, they allege, the company is downright hostile to any that push the issue. We learn:

“When the publishers sent follow-up notices for matters previously reported but not handled to their satisfaction, ‘Google threatened on multiple occasions to stop reviewing all the Publishers’ notices for up to six months,’ the complaint alleges. Google’s response was due to duplicate requests; the company warned that if that happened three or more times on the same request, it would ‘consider that particular request to be manifestly unfounded’ which could lead the company to ‘temporarily stop reviewing your requests for a period of up to 180 days.’”

Ah, corporate logic. Will Google’s pirate booty be worth the legal headaches? The textbook publishers bringing suit include Cengage Learning, Macmillan Learning, Macmillan Holdings, LLC; Elsevier Inc., Elsevier B.V., and McGraw Hill LLC. The complaint was filed in the US District Court for the Southern District of New York.

Cynthia Murrell, July 10, 2024

Written by Stephen E. Arnold · Filed Under Copyright, Google, News | Leave a Comment

Prediction: Next Target Up — Public Libraries

June 26, 2024

This essay is the work of a dinobaby. Unlike some folks, no smart software improved my native ineptness.

The publishers (in spirit at least) have kneecapped the Internet Archive. If you don’t know what the online service does or did, it does not matter. I learned from the estimable ShowBiz411.com site, a cultural treasure is gone. Forget digital books, the article “Paramount Erases Archives of MTV Website, Wipes Music, Culture History After 30 Plus Years” says:

Parent company Paramount, formerly Viacom, has tossed twenty plus years of news archives. All that’s left is a placeholder site for reality shows. The M in MTV – music — is gone, and so is all the reporting and all the journalism performed by music and political writers ever written. It’s as if MTV never existed. (It’s the same for VH1.com, all gone.)

Why? The write up couches the savvy business decision of the Paramount leadership this way:

There’s no precedent for this, and no valid reason. Just cheapness and stupidity.

Tibby, my floppy ear Frenchie, is listening to music from the Internet Archive. He knows the publishers removed 500,000 books. Will he lose access to his beloved early 20th century hill music? Will he ever be able to watch reruns of the rock the casbah music video? No. He is a risk. A threat. A despicable knowledge seeker. Thanks to myself for this nifty picture.

My knowledge of MTV and VH1 is limited. I do recall telling my children, “Would you turn that down, please?” What a waste of energy. Future students of American culture will have a void. I assume some artifacts of the music videos will remain. But the motherlode is gone. Is this a loss? On one hand, no. Thank goodness I will not have to glimpse performs rocking the casbah. On the other hand, yes. Archaeologists study bits of stone, trying to figure out how those who left them built Machu Pichu did it. The value of lost information to those in the future is tough to discuss. But knowledge products may be like mine tailings. At some point, a bright person can figure out how to extract trace elements in quantity.

I have a slightly different view of these two recent cultural milestones. I have a hunch that the publishers want to protect their intellectual property. Internet Archive rolled over because its senior executives learned from their lawyers that lawsuits about copyright violations would be tough to win. The informed approach was to delete 500,000 books. Imagine an online service like the Internet Archive trying to be a library.

That brings me to what I think is going on. Copyright litigation will make quite a lot of digital information disappear. That means that increasing fees to public libraries for digital copies of books to “loan” to patrons must go up. Libraries who don’t play ball may find that those institutions will be faced with other publisher punishments: No American Library Association after parties, no consortia discounts, and at some point no free books.

Yes, libraries will have to charge a patron to check out a physical book and then the “publishers” will get a percentage.

The Andrew Carnegie “free” thing is wrong. Libraries rip off the publishers. Authors may be mentioned, but what publisher cares about 99 percent of its authors? (I hear crickets.)

Several thoughts struck me as I was walking my floppy ear Frenchie:

The loss of information (some of which may have knowledge value) is no big deal in a social structure which does not value education. If people cannot read, who cares about books? Publishers and the wretches who write them. Period.
The video copyright timebomb of the Paramount video content has been defused. Let’s keep those lawyers at bay, please. Who will care? Nostalgia buffs and the parents of the “stars”?
The Internet Archive has music; libraries have music. Those are targets not on Paramount’s back. Who will shoot at these targets? Copyright litigators. Go go go.

Net net: My prediction is that libraries must change to a pay-to-loan model or get shut down. Who wants informed people running around disagreeing with lawyers, accountants, and art history majors?

Stephen E Arnold, June 26, 2024

Written by Stephen E. Arnold · Filed Under Business strategy, Copyright, Financial, News, Publishing | Leave a Comment

Another Small Victory for OpenAI Against Authors

March 12, 2024

This essay is the work of a dumb dinobaby. No smart software required.

For those following the fight between human content creators and AI firms, score one for the algorithm engineers. TorrentFreak reports, “Court Dismisses Authors’ Copyright Infringement Claims Against OpenAI.” At issue is generative AI’s practice of feeding on humans’ work, without compensation, in order to mimic it. Multiple suits have been filed by record labels, writers, and visual artists. Reporter Ernesto Van der Sar writes:

“Several of the lawsuits filed by book authors include a piracy component. The cases allege that tech companies, including Meta and OpenAI, used the controversial Books3 dataset to train their models. The Books3 dataset was created by AI researcher Shawn Presser in 2020, who scraped the library of ‘pirate’ site Bibliotik. The general vision was that the plaintext collection of more than 195,000 books, which is nearly 37GB in size, could help AI enthusiasts build better models. The vision wasn’t wrong; large text archives are great training material for Large Language Models, but many authors disapprove of their works being used in this manner, without permission or compensation.”

A large group of rights holders have a football team. Those big folks are chasing the small but feisty opponent down the field. Which team will score? Thanks, MSFT Copilot. Keep up the good enough work.

Is that so unreasonable? Maybe not, but existing copyright law did not foresee this situation. We learn:

“After reviewing input from both sides, California District Judge Araceli Martínez-Olguín ruled on the matter. In her order, she largely sides with OpenAI. The vicarious copyright infringement claim fails because the court doesn’t agree that all output produced by OpenAI’s models can be seen as a derivative work. To survive, the infringement claim has to be more concrete.”

The plaintiffs are not out of moves, however. They can still file an amended complaint. But unless updated legislation is passed in the meantime, they may just be rebuffed again. So all they need is for Congress to act quickly to protect artists from tech firms. Any day now.

Cynthia Murrell, March 12, 2024

Written by Stephen E. Arnold · Filed Under Copyright, Legal matters, News | 1 Comment

Content Mastication: A Controversial Business Tactic

January 25, 2024

This essay is the work of a dumb dinobaby. No smart software required.

In the midst of the unfolding copyright issues, I found this post quite interesting. Torrent Freak published a story titled “Meta Admits Use of ‘Pirated’ Book Dataset to Train AI.” Is the story spot on? I sure don’t know. Nevertheless, the headline is a magnetic one. The story reports:

The cases allege that tech companies, including Meta and OpenAI, used the controversial Books3 dataset to train their models. The Books3 dataset has a clear piracy angle. It was created by AI researcher Shawn Presser in 2020, who scraped the library of ‘pirate’ site Bibliotik. This book archive was publicly hosted by digital archiving collective ‘The Eye‘ at the time, alongside various other data sources.

A combination of old-fashioned content collection and smart systems move information from Point A (a copyright owner’s night table) to a smart software system. MSFT’s second class Copilot Bing thing created this cartoon. Sigh. Not even good enough now in my opinion.

What was in the Books3 data collection? The TF story elucidates:

The general vision was that the plaintext collection of more than 195,000 books, which is nearly 37GB…

What did Meta allegedly do to make its Llama smarter than the average member of the Camelidae family? Let’s roll the TF quote:

Responding to a lawsuit from writer/comedian Sarah Silverman, author Richard Kadrey, and other rights holders, the tech giant admits that “portions of Books3” were used to train the Llama AI model before its public release. “Meta admits that it used portions of the Books3 dataset, among many other materials, to train Llama 1 and Llama 2,” Meta writes in its answer [to a court].

The article does not include any statements like “Thank you for the question” or “I don’t know. My team will provide the answer at the earliest possible moment.” Nope. Just an alleged admission.

How will the Meta and parallel copyright legal matter evolve? Beyond Search has zero clue. The US judicial system has deep and mysterious logic. One thing is certain: Senior executives do not like uncertainty and risk. The copyright litigation seems tailored to cause some techno feudalists to imagine a world in which laws, annoying regulators, and people yapping about intellectual property were nudged into a different line of work. One example which comes to mind is building secure bunkers or taking care of the lawn.

Stephen E Arnold, January 25, 2024

Written by Stephen E. Arnold · Filed Under AI, Copyright, News | 1 Comment

PicRights in the News: Happy Holidays

November 28, 2023

This essay is the work of a dumb dinobaby. No smart software required.

With the legal eagles cackling in their nests about artificial intelligence software using content without permission, the notion of rights enforcement is picking up steam. One niche in rights enforcement is the business of using image search tools to locate pictures and drawings which appear in blogs or informational Web pages.

StackOverflow hosts a thread by a developer who linked to or used an image more than a decade ago. On November 23, 2023, the individual queried those reading Q&A section about a problem. “Am I Liable for Re-Sharing Image Links Provided via the Stack Exchange API?”

The legal eagle jabs with his beak at the person who used an image, assuming it was open source. The legal eagle wants justice to matter. Thanks, MSFT Copilot. A couple of tries, still not on point, but good enough.

The explanation of the situation is interesting to me for three reasons: [a] The alleged infraction took place in 2010; [b] Stack Exchange is a community learning and sharing site which manifests some open sourciness; and [c] information about specific rights, content ownership, and data reached via links is not front and center.

Ignorance of the law, at least in some circles, is not excuse for a violation. The cited post reveals that an outfit doing business as PicRights wants money for the unlicensed use of an image or art in 2010 (at least that’s how I read the situation).

What’s interesting is the related data provided by those who are responding to the request for information; for example:

A law firm identified as Higbee & Asso. is the apparent pointy end of the spear pointed at the alleged offender’s wallet
A link to an article titled “Is PicRights a Scam? Are Higbee & Associates Emails a Scam?”
A marketing type of write up called “How To Properly Deal With A PicRights Copyright Unlicensed Image Letter”.

How did the story end? Here’s what the person accused of infringing did:

According to Law I am liable. I have therefore decided to remove FAQoverflow completely, all 90k+ pages of it, and will negotiate with PicRights to pay them something less than the AU$970 that they are demanding.

What are the downsides to the loss of the FAQoverflow content? I don’t know. But I assume that the legal eagles, after gobbling one snack, are aloft and watching the AI companies. That’s where the big bucks will be. Legal eagles have a fondness for big bucks I believe.

Net net: Justice is served by some interesting birds, eagles and whatnot.

Stephen E Arnold, November 28, 2023

Written by Stephen E. Arnold · Filed Under Copyright, Legal matters, News | Comments Off on PicRights in the News: Happy Holidays

Microsoft, the Techno-Lord: Avoid My Galloping Steed, Please

November 27, 2023

This essay is the work of a dumb dinobaby. No smart software required.

The Merriam-Webster.com online site defines “responsibility” this way:

re·?spon·?si·?bil·?I·?ty

1 : the quality or state of being responsible: such as
: moral, legal, or mental accountability
: RELIABILITY, TRUSTWORTHINESS
: something for which one is responsible

The online sector has a clever spin on responsibility; that is, in my opinion, the companies have none. Google wants people who use its online tools and post content created with those tools to make sure that what the Google system outputs does not violate any applicable rules, regulations, or laws.

In a traditional fox hunt, the hunters had the “right” to pursue the animal. If a farmer’s daughter were in the way, it was the farmer’s responsibility to keep the silly girl out of the horse’s path. That will teach them to respect their betters I assume. Thanks, MSFT Copilot. I know you would not put me in a legal jeopardy, would you? Now what are the laws pertaining to copyright for a cartoon in Armenia? Darn, I have to know that, don’t I.

Such a crafty way of defining itself as the mere creator of software machines has inspired Microsoft to follow a similar path. The idea is that anyone using Microsoft products, solutions, and services is “responsible” to comply with applicable rules, regulations, and laws.

Tidy. Logical. Complete. Just like a nifty algebra identity.

“Microsoft Wants YOU to Be Sued for Copyright Infringement, Washes Its Hands of AI Copyright Misuse and Says Users Should Be Liable for Copyright Infringement” explains:

Microsoft believes they have no liability if an AI, like Copilot, is used to infringe on copyrighted material.

The write up includes this passage:

So this all comes down to, according to Microsoft, that it is providing a tool, and it is up to users to use that tool within the law. Microsoft says that it is taking steps to prevent the infringement of copyright by Copilot and its other AI products, however, Microsoft doesn’t believe it should be held legally responsible for the actions of end users.

The write up (with no Jimmy Kimmel spin) includes this statement, allegedly from someone at Microsoft:

Microsoft is willing to work with artists, authors, and other content creators to understand concerns and explore possible solutions. We have adopted and will continue to adopt various tools, policies, and filters designed to mitigate the risk of infringing outputs, often in direct response to the feedback of creators. This impact may be independent of whether copyrighted works were used to train a model, or the outputs are similar to existing works. We are also open to exploring ways to support the creative community to ensure that the arts remain vibrant in the future.

From my drafty office in rural Kentucky, the refusal to accept responsibility for its business actions, its products, its policies to push tools and services on users, and the outputs of its cloudy system is quite clever. Exactly how will a user of products pushed at users like Edge and its smart features prevent a user from acquiring from a smart Microsoft system something that violates an applicable rule, regulation, or law?

But legal and business cleverness is the norm for the techno-feudalists. Let the surfs deal with the body of the child killed when the barons chase a fox through a small leasehold. I can hear the brave royals saying, “It’s your fault. Your daughter was in the way. No, I don’t care that she was using the free Microsoft training materials to learn how to use our smart software.”

Yep, responsible. The death of the hypothetical child frees up another space in the training course.

Stephen E Arnold, November 27, 2023

Written by Stephen E. Arnold · Filed Under AI, Copyright, Legal matters, News | Comments Off on Microsoft, the Techno-Lord: Avoid My Galloping Steed, Please

Copyright Trolls: An Explanation Which Identifies Some Creatures

November 14, 2023

This essay is the work of a dumb humanoid. No smart software required.

If you are not familiar with firms which pursue those who intentionally or unintentionally use another person’s work in their writings, you may not know what a “copyright troll” is. I want to point you to an interesting post from IntoTheMinds.com. The write up “PicRights + AFP: Une Opération de Copyright Trolling Bien Rodée.” appeared in 2021, and it was updated in June 2023. The original essay is in French, but you may want to give Google Translate a whirl if your high school French is but a memoire dou dou.

A copyright troll is looking in the window of a blog writer. The troll is waiting for the writer to use content covered by copyright and for which a fee must be paid. The troll is patient. The blog writer is clueless. Thanks, Microsoft Bing. Nice troll. Do you perhaps know one?

The write up does a good job of explaining trollism with particular reference to an estimable outfit called PicRights and the even more estimable Agence France-Presse. It also does a bit of critical review of the PicRights’ operation, including the use of language to alleged copyright violators about how their lives will take a nosedive if money is not paid promptly for the alleged transgression. There are some thoughts about what to do if and when a copyright troll like the one pictured courtesy of Microsoft Bing’s art generator. Some comments about the rules and regulations regarding trollism. The author includes a few observations about the rights of creators. And a few suggested readings are included. Of particular note is the discussion of an estimable legal eagle outfit doing business as Higbee and Associates. You can find that document at this link.

If you are interested in copyright trolling in general and PicRights in particular, I suggest you download the document. I am not sure how long it will remain online.

Stephen E Arnold, November 14, 2023

Written by Stephen E. Arnold · Filed Under Copyright, Legal matters, News | Comments Off on Copyright Trolls: An Explanation Which Identifies Some Creatures

Getty and Its Licensed Smart Software Art

September 26, 2023

Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid. (Yep, the dinobaby is back from France. Thanks to those who made the trip professionally and personally enjoyable.)

The illustration shows a very, very happy image rights troll. The cloud of uncertainty from AI generated images has passed. Now the rights software bots, controlled by cheerful copyright trolls, can scour the Web for unauthorized image use. Forget the humanoids. The action will be from tireless AI generators and equally robust bots designed to charge a fee for the image created by zeros and ones. Yes!

A quite joyful copyright troll displays his killer moves. Thanks, MidJourney. The gradient descent continues, right into the legal eagles’ nests.

“Getty Made an AI Generator That Only Trained on Its Licensed Images” reports:

Generative AI by Getty Images (yes, it’s an unwieldy name) is trained only on the vast Getty Images library, including premium content, giving users full copyright indemnification. This means anyone using the tool and publishing the image it created commercially will be legally protected, promises Getty. Getty worked with Nvidia to use its Edify model, available on Nvidia’s generative AI model library Picasso.

This is exciting. Will the images include a tough-to-discern watermark? Will the images include a license plate, a social security number, or a just a nifty sting of harmless digits?

The article does reveal the money angle:

The company said any photos created with the tool will not be included in the Getty Images and iStock content libraries. Getty will pay creators if it uses their AI-generated image to train the current and future versions of the model. It will share revenues generated from the tool, “allocating both a pro rata share in respect of every file and a share based on traditional licensing revenue.”

Who will be happy? Getty, the trolls, or the designers who have a way to be more productive with a helping hand from the Getty robot? I think the world will be happier because monetization, smart software, and lawyers are a business model with legs… or claws.

Stephen E Arnold, September 26, 2023

Written by Stephen E. Arnold · Filed Under AI, Copyright, Legal matters, News | Comments Off on Getty and Its Licensed Smart Software Art

Can Smart Software Get Copyright? Wrong?

September 15, 2023

It is official: copyrights are for humans, not machines. JD Supra brings us up to date on AI and official copyright guidelines in, “Using AI to Create a Work – Copyright Protection and Infringement.” The basic principle goes both ways. Creators cannot copyright AI-generated material unless they have manipulated it enough to render it a creative work. On the other hand, it is a violation to publish AI-generated content that resembles a copyright-protected work. As for feeding algorithms a diet of human-made media, that is not officially against the rules. Yet. We learn:

“To obtain copyright protection for a work containing AI-generated material, the work must have sufficient human input, such as sufficient modification of the AI output or the human selection or arrangement of the AI content. However, copyright protection would be limited to those ‘human-made’ elements. Past, pending, and future copyright applications need to identify explicitly the human element and disclaim the AI-created content if it is more than minor. For existing registrations, a supplementary registration may be necessary. Works created using AI are subject to the same copyright infringement analysis applicable to any work. The issue with using AI to create works is that the sources of the original works may not be identified, so an infringement analysis cannot be conducted until the cease-and-desist letter is received. No court has yet adopted the theory that merely using an AI database means the resulting work is automatically an infringing derivative work if it is not substantially similar to the protectable elements in the copyrighted work.”

The article cites the Copyright Registration Guidance: Works Containing Material Generated by Artificial Intelligence, 88 Fed. Reg. 16,190 (March 16, 2023). It notes those guidelines were informed by a decision handed down in February, Zarya v Dawn, which involved a comic book with AI-generated content. the Copyright Office sliced and diced elements, specifying:

“… The selection and arrangement of the images and the text were the result of human authorship and thus copyrightable, but the AI-generated images resulting from human prompts were not. The prompts ‘influenced,’ but did not ‘dictate,’ the resulting image, so the applicant was not the ‘mastermind’ and therefore not the author of the images. Further, the applicant’s edits to the images were too minor to be deemed copyrightable.”

Ah, the fine art of splitting hairs. As for training databases packed with protected content, the article points to pending lawsuits by artists against Stability AI, MidJourney, and Deviant Art. We are told those cases may be dismissed on technical grounds, but are advised to watch for similar cases in the future. Stay tuned.

Cynthia Murrell, September 15, 2023

Written by Stephen E. Arnold · Filed Under Copyright, Government, Legal matters, News | Comments Off on Can Smart Software Get Copyright? Wrong?

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.