Original Research: Not-So-Original Assertions about Content Appropriation
April 2, 2025
No AI. Just a dinobaby sharing an observation about younger managers and their innocence.
The Social Science Research Council published the 30-plus page report “Beyond Public Access in LLM Pre-Training Data.” The subtitle reveals the principal finding: “Non-Public Non-Public Book Content in OpenAI’s Models.”
The write up states:
Using a legally obtained dataset of 34 copyrighted O’Reilly Media books, we apply the DE-COP membership inference attack method to investigate whether OpenAI’s large language models were trained on copyrighted content without consent. Our AUROC scores show that GPT-4o, OpenAI’s more recent and capable model, demonstrates strong recognition of paywalled O’Reilly book content (AUROC = 82%), compared to OpenAI’s earlier model GPT-3.5 Turbo. In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O’Reilly book samples. GPT-4o Mini, as a much smaller model, shows no knowledge of public or non-public O’Reilly Media content when tested (AUROC ? 50%). Testing multiple models, with the same cutoff date, helps us account for potential language shifts over time that might bias our findings. These results highlight the urgent need for increased corporate transparency regarding pre-training data sources as a means to develop formal licensing frameworks for AI content training.
I want to mention that the DE-COP method provide one way to determine whether a specific data record was part of the training dataset for a machine learning model. The result of the SSRC’s research suggests that as OpenAI enhanced its models, the OpenAI organization appears to have used non-public content. Smaller (less capable OpenAI) models seemed to know less about the test set; the bigger models knew more.
The write up concludes that developers of smart software are sucking down data without obtaining permission or entering into licensing deals with publishers, copyright holders, or entities working on content not yet available as public content like this dinobaby blog post.
To sum up: A not-to-original assertion that certain organizations just take content and ignore rights is apparently accurate in this set of research. OpenAI accelerates. Speed is a positive. Breaking speed limits is believed to be no big deal.
A bit of a flashback: Over 40 years ago, I participated in a series of meetings to license the right to create abstracts of original articles from more than 1000 different publications. Our efforts were successful. Most publishers of academic business and management publications knew zero about online information services. The idea of summarizing an eight page article about executive compensation was unfamiliar to most of the people with whom our organization interacted over a period of years. Nevertheless, the owner of the database unit of the Courier Journal & Louisville Times Company insisted that we get signed agreements. The process was time consuming and expensive. Some outfits had representatives. So a negotiation went through layers of bureaucracy. Other publishers wanted a sit down meeting and a dog-and-pony show to explain why we wanted to boil down a 600 word article to about 150 words. There were worries about the impact of online access. There were worries about control of editorial content. There were worries about signing away rights. We couldn’t assuage worries, but we chugged along. The reputations of the Courier Journal and Barry Bingham Jr. helped but our small database operation had to follow the rules.
If the data in the cited document are accurate, the attitude of Barry Bingham Jr. would not have gained much traction at OpenAI. Instead of my getting a pat on the head for setting up the licensing deals with the able help of other Courier Journal professionals, I would have gotten kicked on the shin or worse.
Several observations:
- Observing content ownership remains important despite the acceleration toward illegal behaviors in the online sector
- Using show-and-tells and explaining the benefits of participating in digital innovations remains important. Knowledge minimizes anger and other negative reactions to pre-emptive behavior
- Managing the process is as important as demonstrating an unwillingness to conform to industry, social, and legal norms.
Let’s assume that the findings in the SSRC are off by 50 percent. Does that change what appears to be a strong possibility of some fancy dancing at the OpenAI disco?
My conclusion is that taking short cuts is part of the Silicon Valley approach of doing whatever and just saying, “Wow, I am sorry.”
I asked ChatGPT 4 via You.com this question:
Should I create an online product and just use content from the Harvard Business Review without permission?
Here’s the response:
No, you should not create an online product using content from the Harvard Business Review (HBR) without permission. Doing so would likely violate copyright laws and HBR’s usage policies. Here’s why:
HBR’s Copyright and Usage Policies
- HBR Material is Copyrighted: Harvard Business Review explicitly states that its content is protected by copyright, and no part of its publications may be reproduced without permission.
- Usage Restrictions: HBR has strict policies regarding how its content can be used. For example, even in academic settings, HBR articles cannot be freely distributed unless done through specific channels like creating course packs via "Harvard Business Publishing for Educators"
Permission is Required: If you want to use HBR content, you must obtain explicit permission. This includes reproducing, distributing, or using their material in any form.
Legal and Ethical Implications
- Copyright Infringement: Using HBR content without permission would likely constitute copyright infringement, which can result in legal action, fines, or other penalties.
- Ethical Concerns: Beyond legal issues, using someone else’s intellectual property without authorization is unethical and could harm your reputation or business.
What You Should Do Instead
- Seek Permission: Contact Harvard Business Review to request permission to use their content. They may have licensing options available for your intended use.
- Create Original Content: Instead of using HBR’s material, consider creating your own original content inspired by their ideas (but without copying their text or proprietary materials).
- Use Public Domain or Licensed Content: Look for content that is in the public domain or available under a license (e.g., Creative Commons) that allows commercial use.
Conclusion
Using HBR content without permission is not only illegal but also against their explicit policies. To avoid legal and ethical issues, either seek proper licensing or create your own original material.
In this case, the OpenAI’s AI may be more intelligent that some people engaged in content acquisition at the firm. But I am a dinobaby. I try to follow the social, legal, and ethical rules as my parents who had only high school educations taught me. I am a dinobaby. A sad dinobaby.
Stephen E Arnold, April 2, 2025
A Reminder That Academic Papers Can Be Science Fiction
April 2, 2025
Dinobaby says, “No smart software involved. That’s for “real” journalists and pundits.
For many years, I have been skeptical about peer reviewed papers. There are two reasons.
First, I did some consulting work for an outfit in the UK. It was a crown operation. That means its outputs carried considerable influence. One of the projects in which I was engaged involved peer review of research under the purview of the “institute.” Wow, did I learn why there was a publishing backlog. Wow, did I learn about the cronyism in reviewing “real” research. Wow, did I learn about the wheeling and dealing of peer reviewers among their research assistants. Wowzah. That was an education.
Second, for a short time I was a peer reviewer for a British journal. Let me tell you that my first hand exposure to the mechanics and politics of peer reviewing did not prepare me for the reviewing task. A typical submission contained text edited by several hands. None of these was doing fine needlework. A stab here and a stab these summed up the submitted documents. The data and the charts? I had a couple of my team help me figure out if the chart was semi accurate. Working through a five or six page article sent to me for review took me and two people a week to process. In most cases, we gave the paper a D and sent it back to the editor in chief who had to tell the author and his legion of busy bees that the paper sucked. I bailed after six months. Too much work to fix up stuff that was truly terrible.
Today I read “Sometimes Papers Contain Obvious Lies.” That’s a good title, but my thought would be to include the phrase “and Really Crappy.” But I am a dinobaby, and I live in rural Kentucky. The author Cremieux Recueil is much classier than I.
I noted this passage:
The authors of scientific papers often say one thing and find another; they concoct a story around a set of findings that they might not have even made, or which they might have actually even contradicted. This happens surprisingly often, and it’s a very serious issue…
No kidding. The president of Stanford University resigned due to some allegations of fancy dancing. The — note the the — Harvard University experienced a bit of excitement in its ethics department. Is that an oxymoron? An ethics professors violated “ethics” in some research cartwheels.
I liked this sentence because it is closer to my method of communicating concern:
Lying in scientific papers happens all the time.
Hey, not just in scientific papers. I encounter lying 24×7. If someone is not articulating a fabrication, the person may be living a lie. I hear the roar of a 20 somethings hyper car at the gym. Do you?
The paper focuses on a paper with some razzle dazzle related to crime data. The author’s analysis is accurate. However, the focus on an example does not put the scale of the “crime data” problem in perspective.
Let me give you an example and you can test this for validity yourself. Go to your bank. Ask the “personal banker” to tell you about the bank’s experience with cyber crime. Then ask, “How many fraudulent transactions occur at this bank location each year?” Listen to the answer.
Crime data, like health care data, are slippery fish. Numbers do not correlate to reality when scrutinized. Verifiable, statistically valid data is expensive to generate. We live in a “good enough” world and trust whatever black box (human or mechanical) spits out data.
I do disagree with this statement in the essay:
scientists often lie with far more brazenness.
No. Fabrication is now the business of information and the information of business.
Stephen E Arnold, April 2, 2025
The AI Market: The Less-Educated
April 2, 2025
Writing is an essential function of education and communication. Writing is an innate skill as well as one that can be curated through dedicated practice. Digital writing tools such as spelling and grammar checkers and now AI like Grammarly and ChatGPT have influenced writing. Stanford University studied how AI writing tools have impacted writing in professional industries. The discovered that less-educated parts of the US heavily rely on AI. Ars Technica reviews the study in: “Researchers Surprised To Find Less-Educated Areas Adopting AI Writing Tools Faster.”
Stanford’s AI study tracked LLM adoption from January 2022 to September 2024 with a dataset that included US Consumer Financial Protection Bureau consumer complaints, corporate press releases, job postings, and UN press releases. The researchers used a statistical detection system that tracked word usage patterns. The system found that 14-24% of these communications showed AI assistance. The study also found an interesting pattern:
“The study also found that while urban areas showed higher adoption overall (18.2 percent versus 10.9 percent in rural areas), regions with lower educational attainment used AI writing tools more frequently (19.9 percent compared to 17.4 percent in higher-education areas). The researchers note that this contradicts typical technology adoption patterns where more educated populations adopt new tools fastest.”
The researchers theorize that AI-writing tools serve as equalizing measures for less-educated individuals. They also noted that AI-writing tools are being adopted because the market is saturated or the LLMs are becoming more advanced. IT will be difficult to distinguish between human and machine written text. They predict negative outcomes from this:
“ ‘The growing reliance on AI-generated content may introduce challenges in communication,’ the researchers write. ‘In sensitive categories, over-reliance on AI could result in messages that fail to address concerns or overall release less credible information externally. Over-reliance on AI could also introduce public mistrust in the authenticity of messages sent by firms.’”
It’s not good to blindly trust AI, especially with the current state of datasets. Can you imagine the critical thinking skills these future leaders and entrepreneurs will develop? On that thought, what will happen to imagination?
Whitney Grace, April 2, 2025
The First AI-Written Paper To Pass Peer Review
April 2, 2025
Cheating. I am not going to bring this topic up.
Humans have taken one small stop towards obsolesce when it comes to writing papers. Sakana AI reports that "The AI Scientist Generates Its First Peer-Reviewed Scientific Publication." This is the first known fully AI-generated paper that passed the same review process that human scientists submit their papers too. Here’s how the paper was written:
"The paper was generated by an improved version of the original AI Scientist, called The AI Scientist-v2. We will be sharing the full details of The AI Scientist-v2 in an upcoming release. This paper was submitted to an ICLR 2025 workshop that agreed to work with our team to conduct an experiment to double-blind review AI-generated manuscripts. We selected this workshop because of its broader scope, challenging researchers (and our AI Scientist) to tackle diverse research topics that address practical limitations of deep learning. The workshop is hosted at ICLR, one of three premier conferences in machine learning and artificial intelligence research, along with NeurIPS and ICML.3
The ICLR leadership and organizers were involved with the project. The paper was blindly submitted to the ICLR review team, although they were told that they might be reviewing AI generated papers.
The AI algorithm was told to research and write about a broad topic. When the process was done, three papers were selected for submission so the review board wouldn’t be overburdened. Here are the results:
“We looked at the generated papers and submitted those we thought were the top 3 (factoring in diversity and quality—We conducted our own detailed analysis of the 3 papers, please read on in our analysis section). Of the 3 papers submitted, two papers did not meet the bar for acceptance. One paper received an average score of 6.33, ranking approximately 45% of all submissions. These scores are higher than many other accepted human-written papers at the workshop, placing the paper above the average acceptance threshold. Specifically, the scores were:
• Rating: 6: Marginally above acceptance threshold
• Rating: 7: Good paper, accept
• Rating: 6: Marginally above acceptance threshold”
The AI Scientist conducted the experiment out of pure scientific curiosity to measure how current AI algorithms compare to human intellect. No problem.
Whitney Grace, April 2, 2025
FOGINT: Targets Draw Attention. Signal Is a Target
April 1, 2025
Dinobaby says, “No smart software involved. That’s for “real” journalists and pundits.
We have been plugging away on the “Telegram Overview: Notes for Analysts and Investigators.” We have not exactly ignored Signal or the dozens of other super secret, encrypted beyond belief messaging applications. We did compile a table of those we came across, and Signal was on that list.
I read “NSA Warned of Vulnerabilities in Signal App a Month Before Houthi Strike Chat.” I am not interested in the political facets of this incident. The important point for me is this statement:
The National Security Agency sent out an operational security special bulletin to its employees in February 2025 warning them of vulnerabilities in using the encrypted messaging application Signal
One of the big time cyber security companies spoke with me, and I mentioned that Signal might not be the cat’s pajamas. To the credit of that company and the former police chief with whom I spoke, the firm shifted to an end to end encrypted messaging app we had identified as slightly less wonky. Good for that company, and a pat on the back for the police chief who listened to me.
In my experience, operational bulletins are worth reading. When the bulletin is “special,” re-reading the message is generally helpful.
Signal, of course, defends itself vigorously. The coach who loses a basketball game says, “Our players put out a great effort. It just wasn’t enough.”
In the world of presenting oneself as a super secret messaging app immediately makes that messaging app a target. I know first hand that some whiz kid entrepreneurs believe that their EE2E solution is the best one ever. In fact, a year ago, such an entrepreneur told me, “We have developed a method that only a government agency can compromise.”
Yeah, that’s the point of the NSA bulletin.
Let me ask you a question: “How many computer science students in countries outside the United States are looking at EE2E messaging apps and trying to figure out how to compromise the data?” Years ago, I gave some lectures in Tallinn, Estonia. I visited a university computer science class. I asked the students who were working on projects each selected. Several of them told me that they were trying to compromise messaging systems. A favorite target was Telegram but Signal came up.
I know the wizards who cook up EE2E messaging apps and use the latest and greatest methods for delivering security with bells on are fooling themselves. Here are the reasons:
- Systems relying on open source methods are well documented. Exploits exist and we have noticed some CaaS offers to compromise these messages. Now the methods may be illegal in many countries, but they exist. (I won’t provide a checklist in a free blog post. Sorry.)
- Techniques to prevent compromise of secure messaging systems involve some patented systems and methods. Yes, the patents are publicly available, but the methods are simply not possible unless one has considerable resources for software, hardware, and deployment.
- A number of organizations turn EE2E messaging systems into happy eunuchs taking care of the sultan’s harem. I have poked fun at the blunders of the NSO Group and its Pegasus approach, and I have pointed out that the goodies of the Hacking Team escaped into the wild a long time ago. The point is that once the procedures for performing certain types of compromise are no longer secret, other humans can and will create a facsimile and use those emulations to suck down private messages, the metadata, and probably the pictures on the device too. Toss in some AI jazziness, and the speed of the process goes faster than my old 1962 Studebaker Lark.
Let me wrap up by reiterating that I am not addressing the incident involving Signal. I want to point out that I am not into the “information wants to be free.” Certain information is best managed when it is secret. Outfits like Signal and the dozens of other EE2E messaging apps are targets. Targets get hit. Why put neon lights on oneself and try to hide the fact that those young computer science students or their future employers will find a way to compromise the information.
Technical stealth, network fiddling, human bumbling — Compromises will continue to occur. There were good reasons to enforce security. That’s why stringent procedures and hardened systems have been developed. Today it’s marketing, and the possibility that non open source, non American methods may no longer be what the 23 year old art history who has a job in marketing says the systems actually deliver.
Stephen E Arnold, April 1, 2025
Free AI Sites (Well, Mostly Free Sort of)
April 1, 2025
Dinobaby says, “No smart software involved. That’s for “real” journalists and pundits.
One of my team generated images of French bulldogs. After months of effort, he presented me with a picture of our French bulldog complete with one floppy ear. The image was not free. I pay for the service because free image generation systems work and then degrade because of the costs associated with doing smart software without oodles of cash.
Another person proudly emailed everyone a link to Best AI Websites and the page “Free AI Tools.” The interfaces, functionality, and the outputs vary. The linked Web page is a directory presented with some of that mobile interface zip.l
There are more than 30 tools anyone can try. Here’s what the “directory” interface looks like:
The first click displays the BestFreeAIWebsites’ write up for each “service” or “tool.” Then a direct link to the free AI site is displayed. There is a “submit” button to allow those with a free AI tool to add theirs to the listing. The “add” function is a common feature of Telegram bot and Channel listings.
Here is a selection of the “free” services that are available as of March 28, 2025, in alphabetical order:
- HUUK.ai, a trip planner
- Metavoice at https://studio.themetavoice.xyz/, a “one click voice changer”
- Presentpicker.ai, a service to help a user choose a gift.
- Remaker.ai, a face swap tool
- Yomii.app, a real estate investing assistant
ChatGPT features numerous times in the list of “free” AI tools. Google shows up a couple of times with Bard and Gemini. The majority of the services “wrap” functionality around the big dogs in the LLM space.
Are these services “free”? Our view is that the “free” is a way to get people to give the services a try. If the experience is positive, upgrades are available.
As one of my team worked through the listings, he said, “Most of these services have been available as Telegram bots from other developers.” If he is correct, perhaps Telegram’s AI functions should be included in the listing?
Stephen E Arnold, April 1, 2025
Amazon: So Many Great Ideas
April 1, 2025
AWS puts its customers first. Well, those who pay for the premium support plan, anyway. A thread on Reddit complains, "AWS Blocking Troubleshooting Docs Behind Paid Premium Support Plan." Redditor Certain_Dog1960 writes:
"When did AWS decide that troubleshooting docs/articles require you to have a paid premium support plan….like seriously who thought this was a good idea?"
Good question. The comments and the screenshot of Amazon’s message make clear that the company’s idea of how to support customers is different from actual customers’ thoughts. However, Certain_Dog posted an encouraging update:
"The paywall has been taken down!!! :)"
Apparently customer outrage still makes a difference. Occasionally.
Cynthia Murrell, March 31, 2025