Data Thirst? Guess Who Can Help?

April 17, 2024

As large language models approach the limit of freely available data on the Internet, companies are eyeing sources supposedly protected by copyrights and user agreements. PCMag reports, “Google Let OpenAI Scrape YouTube Data Because Google Was Doing It Too.” It seems Google would rather double down on violations than be hypocritical. Writer Emily Price tells us:

“OpenAI made headlines recently after its CTO couldn’t say definitively whether the company had trained its Sora video generator on YouTube data, but it looks like most of the tech giants—OpenAI, Google, and Meta—have dabbled in potentially unauthorized data scraping, or at least seriously considered it. As the New York Times reports, OpenAI transcribed than a million hours of YouTube videos using its Whisper technology in order to train its GPT-4 AI model. But Google, which owns YouTube, did the same, potentially violating its creators’ copyrights, so it didn’t go after OpenAI. In an interview with Bloomberg this week, YouTube CEO Neal Mohan said the company’s terms of service ‘does not allow for things like transcripts or video bits to be downloaded, and that is a clear violation of our terms of service.’ But when pressed on whether YouTube data was scraped by OpenAI, Mohan was evasive. ‘I have seen reports that it may or may not have been used. I have no information myself,’ he said.”

How silly to think the CEO would have any information. Besides stealing from YouTube content creators, companies are exploring other ways to pierce untapped sources of data. According to the Times article cited above, Meta considered buying Simon & Schuster to unlock all its published works. We are sure authors would have been thrilled. Meta executives also considered scraping any protected data it could find and hoping no one would notice. If caught, we suspect they would consider any fees a small price to pay.

The same article notes Google changed its terms of service so it could train its AI on Google Maps reviews and public Google Docs. See, the company can play by the rules, as long as it remembers to change them first. Preferably, as it did here, over a holiday weekend.

Cynthia Murrell, April 17, 2024


Got something to say?

  • Archives

  • Recent Posts

  • Meta