How Does One Train Smart Software?

June 8, 2023

Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

It is awesome when geekery collides with the real world, such as the development of AI. These geekery hints prove that fans are everywhere and the influence of fictional worlds leave a lasting impact. Usually these hints are naming a new discovery after a favorite character or franchise, but it might not be good for copyrighted books beloved by geeks everywhere. The New Scientist reports that “ChatGPT Seems To Be Trained On Copyrighted Books Like Harry Potter.”

In order to train AI models, AI developers need large language models or datasets. Datasets can range from information on social media platforms to shopping databases like Amazon. The problem with ChatGPT is that it appears its developers at OpenAI used copyrighted books as language models. If OpenAI used copyrighted materials it brings into question if the datasets were legality created.

Associate Professor David Bamman of the University of California, Berkley campus, and his team studied ChatGPT. They hypothesized that OpenAI used copyrighted material. Using 600 fiction books from 1924-2020, Bamman and his team selected 100 passages from each book that ha a single, named character. The name was blanked out of the passages, then ChatGPT was asked to fill them. ChatGPT had a 98% accuracy rate with books ranging from J.K. Rowling, Ray Bradbury, Lewis Carroll, and George R.R. Martin.

If ChatGPT is only being trained from these books, does it violate copyright?

“ ‘The legal issues are a bit complicated,’ says Andres Guadamuz at the University of Sussex, UK. ‘OpenAI is training GPT with online works that can include large numbers of legitimate quotes from all over the internet, as well as possible pirated copies.’ But these AIs don’t produce an exact duplicate of a text in the same way as a photocopier, which is a clearer example of copyright infringement. ‘ChatGPT can recite parts of a book because it has seen it thousands of times,’ says Guadamuz. ‘The model consists of statistical frequency of words. It’s not reproduction in the copyright sense.’”

Individual countries will need to determine dataset rules, but it is preferential to notify authors their material is being used. Fiascos are already happening with stolen AI generated art.

ChatGPT was mostly trained on science fiction novels, while it did not read fiction from minority authors like Toni Morrison. Bamman said ChatGPT is lacking representation. That his one way to describe the datasets, but it more likely pertains to the human AI developers reading tastes. I assume there was little interest in books about ethics, moral behavior, and the old-fashioned William James’s view of right and wrong. I think I assume correctly.

Whitney Grace, June 8, 2023

Written by Stephen E. Arnold · Filed Under AI, Business strategy, News

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.