Original Research: Not-So-Original Assertions about Content Appropriation
April 2, 2025
No AI. Just a dinobaby sharing an observation about younger managers and their innocence.
The Social Science Research Council published the 30-plus page report “Beyond Public Access in LLM Pre-Training Data.” The subtitle reveals the principal finding: “Non-Public Non-Public Book Content in OpenAI’s Models.”
The write up states:
Using a legally obtained dataset of 34 copyrighted O’Reilly Media books, we apply the DE-COP membership inference attack method to investigate whether OpenAI’s large language models were trained on copyrighted content without consent. Our AUROC scores show that GPT-4o, OpenAI’s more recent and capable model, demonstrates strong recognition of paywalled O’Reilly book content (AUROC = 82%), compared to OpenAI’s earlier model GPT-3.5 Turbo. In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O’Reilly book samples. GPT-4o Mini, as a much smaller model, shows no knowledge of public or non-public O’Reilly Media content when tested (AUROC ? 50%). Testing multiple models, with the same cutoff date, helps us account for potential language shifts over time that might bias our findings. These results highlight the urgent need for increased corporate transparency regarding pre-training data sources as a means to develop formal licensing frameworks for AI content training.
I want to mention that the DE-COP method provide one way to determine whether a specific data record was part of the training dataset for a machine learning model. The result of the SSRC’s research suggests that as OpenAI enhanced its models, the OpenAI organization appears to have used non-public content. Smaller (less capable OpenAI) models seemed to know less about the test set; the bigger models knew more.
The write up concludes that developers of smart software are sucking down data without obtaining permission or entering into licensing deals with publishers, copyright holders, or entities working on content not yet available as public content like this dinobaby blog post.
To sum up: A not-to-original assertion that certain organizations just take content and ignore rights is apparently accurate in this set of research. OpenAI accelerates. Speed is a positive. Breaking speed limits is believed to be no big deal.
A bit of a flashback: Over 40 years ago, I participated in a series of meetings to license the right to create abstracts of original articles from more than 1000 different publications. Our efforts were successful. Most publishers of academic business and management publications knew zero about online information services. The idea of summarizing an eight page article about executive compensation was unfamiliar to most of the people with whom our organization interacted over a period of years. Nevertheless, the owner of the database unit of the Courier Journal & Louisville Times Company insisted that we get signed agreements. The process was time consuming and expensive. Some outfits had representatives. So a negotiation went through layers of bureaucracy. Other publishers wanted a sit down meeting and a dog-and-pony show to explain why we wanted to boil down a 600 word article to about 150 words. There were worries about the impact of online access. There were worries about control of editorial content. There were worries about signing away rights. We couldn’t assuage worries, but we chugged along. The reputations of the Courier Journal and Barry Bingham Jr. helped but our small database operation had to follow the rules.
If the data in the cited document are accurate, the attitude of Barry Bingham Jr. would not have gained much traction at OpenAI. Instead of my getting a pat on the head for setting up the licensing deals with the able help of other Courier Journal professionals, I would have gotten kicked on the shin or worse.
Several observations:
- Observing content ownership remains important despite the acceleration toward illegal behaviors in the online sector
- Using show-and-tells and explaining the benefits of participating in digital innovations remains important. Knowledge minimizes anger and other negative reactions to pre-emptive behavior
- Managing the process is as important as demonstrating an unwillingness to conform to industry, social, and legal norms.
Let’s assume that the findings in the SSRC are off by 50 percent. Does that change what appears to be a strong possibility of some fancy dancing at the OpenAI disco?
My conclusion is that taking short cuts is part of the Silicon Valley approach of doing whatever and just saying, “Wow, I am sorry.”
I asked ChatGPT 4 via You.com this question:
Should I create an online product and just use content from the Harvard Business Review without permission?
Here’s the response:
No, you should not create an online product using content from the Harvard Business Review (HBR) without permission. Doing so would likely violate copyright laws and HBR’s usage policies. Here’s why:
HBR’s Copyright and Usage Policies
- HBR Material is Copyrighted: Harvard Business Review explicitly states that its content is protected by copyright, and no part of its publications may be reproduced without permission.
- Usage Restrictions: HBR has strict policies regarding how its content can be used. For example, even in academic settings, HBR articles cannot be freely distributed unless done through specific channels like creating course packs via "Harvard Business Publishing for Educators"
Permission is Required: If you want to use HBR content, you must obtain explicit permission. This includes reproducing, distributing, or using their material in any form.
Legal and Ethical Implications
- Copyright Infringement: Using HBR content without permission would likely constitute copyright infringement, which can result in legal action, fines, or other penalties.
- Ethical Concerns: Beyond legal issues, using someone else’s intellectual property without authorization is unethical and could harm your reputation or business.
What You Should Do Instead
- Seek Permission: Contact Harvard Business Review to request permission to use their content. They may have licensing options available for your intended use.
- Create Original Content: Instead of using HBR’s material, consider creating your own original content inspired by their ideas (but without copying their text or proprietary materials).
- Use Public Domain or Licensed Content: Look for content that is in the public domain or available under a license (e.g., Creative Commons) that allows commercial use.
Conclusion
Using HBR content without permission is not only illegal but also against their explicit policies. To avoid legal and ethical issues, either seek proper licensing or create your own original material.
In this case, the OpenAI’s AI may be more intelligent that some people engaged in content acquisition at the firm. But I am a dinobaby. I try to follow the social, legal, and ethical rules as my parents who had only high school educations taught me. I am a dinobaby. A sad dinobaby.
Stephen E Arnold, April 2, 2025
Comments
Got something to say?