No Llama 3 for EU
July 31, 2024
Frustrated with European regulators, Meta is ready to take its AI ball and go home. Axios reveals, “Scoop: Meta Won’t Offer Future Multimodal AI Models in EU.” Reporter Ina Fried writes:
“Meta will withhold its next multimodal AI model — and future ones — from customers in the European Union because of what it says is a lack of clarity from regulators there, Axios has learned. Why it matters: The move sets up a showdown between Meta and EU regulators and highlights a growing willingness among U.S. tech giants to withhold products from European customers. State of play: ’We will release a multimodal Llama model over the coming months, but not in the EU due to the unpredictable nature of the European regulatory environment,’ Meta said in a statement to Axios.”
So there. And Meta is not the only firm petulant in the face of privacy regulations. Apple recently made a similar declaration. So governments may not be able to regulate AI, but AI outfits can try to regulate governments. Seems legit. The EU’s stance is that Llama 3 may not feed on European users’ Facebook and Instagram posts. Does Meta hope FOMO will make the EU back down? We learn:
“Meta plans to incorporate the new multimodal models, which are able to reason across video, audio, images and text, in a wide range of products, including smartphones and its Meta Ray-Ban smart glasses. Meta says its decision also means that European companies will not be able to use the multimodal models even though they are being released under an open license. It could also prevent companies outside of the EU from offering products and services in Europe that make use of the new multimodal models. The company is also planning to release a larger, text-only version of its Llama 3 model soon. That will be made available for customers and companies in the EU, Meta said.”
The company insists EU user data is crucial to be sure its European products accurately reflect the region’s terminology and culture. Sure That is almost a plausible excuse.
Cynthia Murrell, July 31, 2024
Llama Beans? Is That the LLM from Zuckbook?
August 4, 2023
Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.
We love open-source projects. Camelids that masquerade as such, not so much. According to The Register, “Meta Can Call Llama 2 Open Source as Much as It Likes, but That Doesn’t Mean It Is.” The company asserts its new large language model is open source because it is freely available for research and (some) commercial use. Are Zuckerburg and his team of Meta marketers fuzzy on the definition of open source? Writer Steven J. Vaughan-Nichols builds his case with quotes from several open source authorities. First up:
“As Erica Brescia, a managing director at RedPoint, the open source-friendly venture capital firm, asked: ‘Can someone please explain to me how Meta and Microsoft can justify calling Llama 2 open source if it doesn’t actually use an OSI [Open Source Initiative]-approved license or comply with the OSD [Open Source Definition]? Are they intentionally challenging the definition of OSS [Open Source Software]?'”
Maybe they are trying. After all, open source is good for business. And being open to crowd-sourced improvements does help the product. However, as the post continues:
“The devil is in the details when it comes to open source. And there, Meta, with its Llama 2 Community License Agreement, falls on its face. As The Register noted earlier, the community agreement forbids the use of Llama 2 to train other language models; and if the technology is used in an app or service with more than 700 million monthly users, a special license is required from Meta. It’s also not on the Open Source Initiative’s list of open source licenses.”
Next, we learn OSI‘s executive director Stefano Maffulli directly states Llama 2 does not meet his organization’s definition of open source. The write-up quotes him:
“While I’m happy that Meta is pushing the bar of available access to powerful AI systems, I’m concerned about the confusion by some who celebrate Llama 2 as being open source: if it were, it wouldn’t have any restrictions on commercial use (points 5 and 6 of the Open Source Definition). As it is, the terms Meta has applied only allow some commercial use. The keyword is some.”
Maffulli further clarifies Meta’s license specifically states Amazon, Google, Microsoft, Bytedance, Alibaba, and any startup that grows too much may not use the LLM. Such a restriction is a no-no in actual open source projects. Finally, Software Freedom Conservancy executive Karen Sandler observes:
“It looks like Meta is trying to push a license that has some trappings of an open source license but, in fact, has the opposite result. Additionally, the Acceptable Use Policy, which the license requires adherence to, lists prohibited behaviors that are very expansively written and could be very subjectively applied.”
Perhaps most egregious for Sandler is the absence of a public drafting or comment process for the Llama 2 license. Llamas are not particularly speedy creatures.
Cynthia Murrell, August 4, 2023
Stanford: Llama Hallucinating at the Dollar Store
March 21, 2023
Editor’s Note: This essay is the work of a real, and still alive, dinobaby. No smart software involved with the exception of the addled llama.
What happens when folks at Stanford University use the output of OpenAI to create another generative system? First, a blog article appears; for example, “Stanford’s Alpaca Shows That OpenAI May Have a Problem.” Second, I am waiting for legal eagles to take flight. Some may already be aloft and circling.
A hallucinating llama which confused grazing on other wizards’ work with munching on mushrooms. The art was a creation of ScribbledDiffusion.com. The smart software suggests the llama is having a hallucination.
What’s happening?
The model trained from OWW or Other Wizards’ Work mostly works. The gotcha is that using OWW without any silly worrying about copyrights was cheap. According to the write up, the total (excluding wizards’ time) was $600.
The article pinpoints the issue:
Alignment researcher Eliezer Yudkowsky summarizes the problem this poses for companies like OpenAI:” If you allow any sufficiently wide-ranging access to your AI model, even by paid API, you’re giving away your business crown jewels to competitors that can then nearly-clone your model without all the hard work you did to build up your own fine-tuning dataset.” What can OpenAI do about that? Not much, says Yudkowsky: “If you successfully enforce a restriction against commercializing an imitation trained on your I/O – a legal prospect that’s never been tested, at this point – that means the competing checkpoints go up on BitTorrent.”
I love the rapid rise in smart software uptake and now the snappy shift to commoditization. The VCs counting on big smart software payoffs may want to think about why the llama in the illustration looks as if synapses are forming new, low cost connections. Low cost as in really cheap I think.
Stephen E Arnold, March 21, 2023
Does a LLamA Bite? No, But It Can Be Snarky
February 28, 2023
Everyone in Harrod’s Creek knows the name Yann LeCun. The general view is that when it comes to smart software, this wizard wrote or helped write the book. I spotted a tweet thread “LLaMA Is a New *Open-Source*, High-Performance Large Language Model from Meta AI – FAIR.” The link to the Facebook research paper “LLaMA: Open and Efficient Foundation Language Models” explains the innovation for smart software enthusiasts. In a nutshell, the Zuck approach is bigger, faster, and trained without using data not available to everyone. Also, it does not require Googzilla scale hardware for some applications.
That’s the first tip off that the technical paper has a snarky sparkle. Exactly what data have been used to train Google and other large language models. The implicit idea is that the legal eagles flock to sue for copyright violating actions, the Zuckers are alleged flying in clean air.
Here are a few other snarkifications I spotted:
- Use small models trained on more data. The idea is that others train big Googzilla sized models trained on data, some of which is not public available
- The Zuck approach an “efficient implementation of the causal multi-head attention operator.” The idea is that the Zuck method is more efficient; therefore, better
- In testing performance, the results are all over the place. The reason? The method for determining performance is not very good. Okay, still Meta is better. The implication is that one should trust Facebook. Okay. That’s scientific.
- And cheaper? Sure. There will be fewer legal fees to deal with pesky legal challenges about fair use.
What’s my take? Another open source tool will lead to applications built on top of the Zuckbook’s approach.
Now the developers and users will have to decide if the LLamA can bite? Does Facebook have its wizardly head in the Azure clouds? Will the Sages of Amazon take note?
Tough questions. At first glance, llamas have other means of defending themselves. Teeth may not be needed. Yes, that’s snarky.
Stephen E Arnold, February 28, 2023
The Many Faces of Zuckbook
March 29, 2024
This essay is the work of a dumb dinobaby. No smart software required.
As evidenced by his business decisions, Mark Zuckerberg seems to be a complicated fellow. For example, a couple recent articles illustrate this contrast: On one hand is his commitment to support open source software, an apparently benevolent position. On the other, Meta is once again in the crosshairs of EU privacy advocates for what they insist is its disregard for the law.
First, we turn to a section of VentureBeat’s piece, “Inside Meta’s AI Strategy: Zuckerberg Stresses Compute, Open Source, and Training Data.” In it, reporter Sharon Goldman shares highlights from Meta’s Q4 2023 earnings call. She emphasizes Zuckerberg’s continued commitment to open source software, specifically AI software Llama 3 and PyTorch. He touts these products as keys to “innovation across the industry.” Sounds great. But he also states:
“Efficiency improvements and lowering the compute costs also benefit everyone including us. Second, open source software often becomes an industry standard, and when companies standardize on building with our stack, that then becomes easier to integrate new innovations into our products.”
Ah, there it is.
Our next item was apparently meant to be sneaky, but who did Meta think it was fooling? The Register reports, “Meta’s Pay-or-Consent Model Hides ‘Massive Illegal Data Processing Ops’: Lawsuit.” Meta is attempting to “comply” with the EU’s privacy regulations by making users pay to opt in to them. That is not what regulators had in mind. We learn:
“Those of us with aunties on FB or friends on Instagram were asked to say yes to data processing for the purpose of advertising – to ‘choose to continue to use Facebook and Instagram with ads’ – or to pay up for a ‘subscription service with no ads on Facebook and Instagram.’ Meta, of course, made the changes in an attempt to comply with EU law. But privacy rights folks weren’t happy about it from the get-go, with privacy advocacy group noyb (None Of Your Business), for example, sarcastically claiming Meta was proposing you pay it in order to enjoy your fundamental rights under EU law. The group already challenged Meta’s move in November, arguing EU law requires consent for data processing to be given freely, rather than to be offered as an alternative to a fee. Noyb also filed a lawsuit in January this year in which it objected to the inability of users to ‘freely’ withdraw data processing consent they’d already given to Facebook or Instagram.”
And now eight European Consumer Organization (BEUC) members have filed new complaints, insisting Meta’s pay-or-consent tactic violates the European General Data Protection Regulation (GDPR). While that may seem obvious to some, Meta insists it is in compliance with the law. Because of course it does.
Cynthia Murrell, March 29, 2024
The Big Battle: Another WWF Show Piece for AI
August 2, 2024
This essay is the work of a dumb humanoid. No smart software required.
The Zuck believes in open source. It is like Linux. Boom. Market share. OpenAI believes in closed source (for now). Snap. You have to pay to get the good stuff. The argument about proprietary versus open source has been plodding along like Russia’s special operation for a long time. A typical response, in my opinion, is that open source is great because it allows a corporate interest to get cheap traction. Then with a surgical or not-so-surgical move, the big outfit co-opts the open source project. Boom. Semi-open source with a price tag becomes a competitive advantage. Proprietary software can be given away, licensed, or made available by subscription. Open source creates opportunities for training, special services, and feeling good about the community. But in the modern world of high-technology feeling good comes with sustainable flows of revenue and opportunities to raise prices faster than the local grocery store.
Where does open source software come from? Many students demonstrate their value by coding something useful to another. Thanks, Open AI. Good enough.
I read “Consider the Llama: Are Closed Source AI Models Doomed?” The write up is good. It contains a passage which struck me as interesting; to wit:
OpenAI, Anthropic and the like—companies that sell access to AI models. These companies inherently require their products to be much better than open source in order to up-charge. They also don’t have some other product they sell that gets improved with better AI overall.
In my opinion, in the present business climate, the hope that a high-technology product gets better is an interesting one. The idea of continual improvement, however, is not part of the business culture of high-technology companies engaged in smart software. At this time, cooking up a model which can be used to streamline or otherwise enhance an existing activity is Job One. The first outfit to generate substantial revenue from artificial intelligence will have an advantage. That doesn’t mean the outfit won’t fail, but if one considers the requirements to play with a reasonable probability of winning the AI game, smart software costs money.
In the world of online, a company or open source foundation which delivers a product or service which attracts large numbers of users has an advantage. One “play” can shift the playing field, not just win the game. What’s going on at this time, in my opinion, is that those who understand the advantage of winning in the equivalent of a WWF (World Wide Wrestling) show piece is that it allows the “winner take all” or at least the “winner takes two-thirds” of the market.
Monopolies (real or imagined) with lots of money have an advantage. Open source smart software have to have money from somewhere; otherwise, the costs of producing a winning service drop. If a large outfit with cash goes open source, that is a bold chess move which other outfits cannot afford to take. The feel good, community aspect of a smart software solution that can be used in a large number of use cases is going to fade quickly when any money on the table is taken by users who neither contribute, pay for training, or hire great open source coders as consultants. Serious players just take the software, innovate, and lock up the benefits.
“Who would do this?” some might ask.
How about China, Russia, or some nation state not too interested in the Silicon Valley way? How about an entrepreneur in Armenia or one of the Stans who wants to create a novel product or service and charge for it? Sure, US-based services may host the product or service, but the actual big bucks flow to the outfit who keeps the technology “secret”?
At this time, US companies which make high-value software available for free to anyone who can connect to the Internet and download a file are not helping American business. You may disagree. But I know that there are quite a few organizations (commercial and governmental) who think the US approach to open source software is just plain dumb.
Wrapping up an important technology with do-goodism and mostly faux hand waving about the community creates two things:
- An advantage for commercial enterprises who want to thwart American technical influence
- Free intelligence for nation-states who would like nothing more than convert the US into a client republic.
I did a job for a bunch of venture people who were into the open source religion. The reality is that at this time an alleged monopoly like Google can use its money and control of information flows to cripple other outfits trying to train their systems. On the other hand, companies who just want AI to work may become captive to an enterprise software vendor who is also an alleged monopoly. The companies funded by this firm have little chance of producing sustainable revenue. The best exits will be gift wrapping the “innovation” and selling it to another group of smart software-hungry investors.
Does the world need dozens of smart software “big dogs”? The answer is, “No.” At this time, the US is encouraging companies to make great strides in smart software. These are taking place. However, the rest of the world is learning and may have little or no desire to follow the open source path to the big WWF face off in the US.
The smart software revolution is one example of how America’s technology policy does not operate in a way that will cause our adversaries to do anything but download, enhance, build on, and lock up increasingly smarter AI systems.
From my vantage point, it is too late to undo the damage the wildness of the last few years can be remediated. The big winners in open source are not the individual products. Like the WWF shows, the winner is the promoter. Very American and decidedly different from what those in other countries might expect or want. Money, control, and power are more important than the open source movement. Proprietary may be that group’s preferred approach. Open source is software created by computer science students to prove they can produce code that does something. The “real” smart software is quite different.
Stephen E Arnold, August 2, 2024
Scale Is All You Need for Superintelligence: Interesting Assertion
July 2, 2024
This essay is the work of a dumb dinobaby. No smart software required. This blog post includes information from Claude 3.5 Sonnet. The information from smart software is in blue.
At this time, smart software boils down to the “transformer.” I navigated to Claude 3.5 Sonnet and asked the smart software to explain “transformer.” Here’s what the system generated:
The Google Transformer innovation, introduced in 2017, revolutionized natural language processing by introducing a novel architecture that relies entirely on attention mechanisms. This approach allows the model to process input sequences in parallel, rather than sequentially, leading to significant improvements in translation quality and training efficiency. The Transformer’s key features include self-attention, which enables it to capture relationships between words regardless of their position in a sentence, and positional encoding, which helps the model understand word order without relying on recurrent neural networks
I then asked, “Are there other ways to achieve smart software or AI information functions? Claud 3.5 Sonnet spit out this list:
- Machine Learning Algorithms
- Expert Systems
- Neural Networks.
Options are good. But the buzz focuses on transformers, a Google “invention” allegedly a decade old (but some suggest its roots reach back into the mists of time). But let’s stick with the Google and a decade.
The future is on the horizon. Thanks, MSFT Copilot. Good enough and you spelled “future” correctly.
“Etched Is Making the Biggest Bet in AI” That’s is an interesting statement. The company states what its chip is not:
By burning the transformer architecture into our chip, we can’t run most traditional AI models: the DLRMs powering Instagram ads, protein-folding models like AlphaFold 2, or older image models like Stable Diffusion 2. We can’t run CNNs, RNNs, or LSTMs either. But for transformers, Sohu is the fastest chip of all time.
What does the chip do? The company says:
With over 500,000 tokens per second in Llama 70B throughput, Sohu lets you build products impossible on GPUs. Sohu is an order of magnitude faster and cheaper than even NVIDIA’s next-generation Blackwell (B200) GPUs.
The company again points out the downside of its “bet the farm” approach:
Today, every state-of-the-art AI model is a transformer: ChatGPT, Sora, Gemini, Stable Diffusion 3, and more. If transformers are replaced by SSMs, RWKV, or any new architecture, our chips will be useless.
Yep, useless.
What is Etched’s big concept? The company says:
Scale is all you need for superintelligence.
This means in my dinobaby-impaired understanding that big delivers a really smarter smart software. Skip the power, pipes, and pings. Just scale everything. The company agrees:
By feeding AI models more compute and better data, they get smarter. Scale is the only trick that’s continued to work for decades, and every large AI company (Google, OpenAI / Microsoft, Anthropic / Amazon, etc.) is spending more than $100 billion over the next few years to keep scaling.
Because existing chips are “hitting a wall,” a number of companies are in the smart software chip business. The write up mentions 12 of them, and I am not sure the list is complete.
Etched is different. The company asserts:
No one has ever built an algorithm-specific AI chip (ASIC). Chip projects cost $50-100M and take years to bring to production. When we started, there was no market.
The company walks through the problems of existing chips and delivers it knock out punch:
But since Sohu only runs transformers, we only need to write software for transformers!
Reduced coding and an optimized chip: Superintelligence is in sight. Does the company want you to write a check? Nope. Here’s the wrap up for the essay:
What happens when real-time video, calls, agents, and search finally just work? Soon, you can find out. Please apply for early access to the Sohu Developer Cloud here. And if you’re excited about solving the compute crunch, we’d love to meet you. This is the most important problem of our time. Please apply for one of our open roles here.
What’s the timeline? I don’t know. What’s the cost of an Etched chip? I don’t know. What’s the infrastructure required. I don’t know. But superintelligence is almost here.
Stephen E Arnold, July 2, 2024
Free AI Round Up with Prices
June 18, 2024
This essay is the work of a dinobaby. Unlike some folks, no smart software improved my native ineptness.
EWeek (once PCWeek and a big fat Ziff publication) has published what seems to be a mash up of MBA-report writing, a bit of smart software razzle dazzle, and two scoops of Gartner Group-type “insight.” The report is okay, and its best feature is that it is free. Why pay a blue-chip or mid-tier consulting firm to assemble a short monograph? Just navigate to “21 Best Generative AI Chatbots.”
A lecturer shocks those in the presentation with a hard truth: Human-generated reports are worse than those produced by a “leading” smart software system. Is this the reason a McKinsey professional told interns, “Prompts are the key to your future.” Thanks, MSFT Copilot. Good enough.
The report consists of:
A table with the “leading” chatbots presented in random order. Forget that alphabetization baloney. Sorting by “leading” chatbot name is so old timey. The table presents these evaluative/informative factors:
- Best for use case; that is, in the opinion of the “analysts” when one would use a specific chatbot in the opinion of the EWeek “experts” I assume
- Query limit. This is baffling since recyclers of generative technology are eager to sell a range of special plans
- Language model. This column is interesting because it makes clear that of the “leading” chatbots 12 of them are anchored in OpenAI’s “solutions”; Claude turns up three times, and Llama twice. A few vendors mention the use of multiple models, but the “report” does not talk about AI layering or the specific ways in which different systems contribute to the “use case” for each system. Did I detect a sameness in the “leading” solutions? Yep.
- The baffling Chrome “extension.” I think the idea is that the “leading” solution with a Chrome extension runs in the Google browser. Five solutions do run as a Chrome extension. The other 16 don’t.
- Pricing. Now prices are slippery. My team pays for ChatGPT, but since the big 4o, the service seems to be free. We use a service not on the list, and each time I access the system, the vendor begs — nay, pleads — for more money. One vendor charges $2,500 per month paid annually. Now, that’s a far cry from Bing Chat Enterprise at $5 per month, which is not exactly the full six pack.
The bulk of the report is a subjective score for each service’s feature set, its ease of use, the quality of output (!), and support. What these categories mean is not provided in a definition of terms. Hey, everyone knows about “quality,” right? And support? Have you tried to contact a whiz-bang leading AI vendor? Let me know how that works out? The screenshots vary slightly, but the underlying sameness struck me. Each write up includes what I would call a superficial or softball listing of pros and cons.
The most stunning aspect of the report is the explanation of “how” the EWeek team evaluated these “leading” systems. Gee, what systems were excluded and why would have been helpful in my opinion. Let me quote the explanation of quality:
To determine the output quality generated by the AI chatbot software, we analyzed the accuracy of responses, coherence in conversation flow, and ability to understand and respond appropriately to user inputs. We selected our top solutions based on their ability to produce high-quality and contextually relevant responses consistently.
Okay, how many queries? How were queries analyzed across systems, assuming similar systems received the same queries? Which systems hallucinated or made up information? What queries causes one or more systems to fail? What were the qualifications of those “experts” evaluating the system responses? Ah, so many questions. My hunch is that EWeek just skipped the academic baloney and went straight to running queries, plugging in a guess-ti-mate, and heading to Starbucks? I do hope I am wrong, but I worked at the Ziffer in the good old days of the big fat PCWeek. There was some rigor, but today? Let’s hit the gym?
What is the conclusion for this report about the “leading” chatbot services? Here it is:
Determining the “best” generative AI chatbot software can be subjective, as it largely depends on a business’s specific needs and objectives. Chatbot software is enormously varied and continuously evolving, and new chatbot entrants may offer innovative features and improvements over existing solutions. The best chatbot for your business will vary based on factors such as industry, use case, budget, desired features, and your own experience with AI. There is no “one size fits all” chatbot solution.
Yep, definitely worth the price of admission.
Stephen E Arnold, June 18, 2024
Meta Mismatch: Good at One Thing, Not So Good at Another
May 27, 2024
This essay is the work of a dinobaby. Unlike some folks, no smart software improved my native ineptness.
I read “While Meta Stuffs AI Into All Its Products, It’s Apparently Helpless to Stop Perverts on Instagram From Publicly Lusting Over Sexualized AI-Generated Children.” The main idea is that Meta has a problems stopping “perverts.” You know a “pervert,” don’t you. One can spot ‘em when one sees ‘em. The write up reports:
As Facebook and Instagram owner Meta seeks to jam generative AI into every feasible corner of its products, a disturbing Forbes report reveals that the company is failing to prevent those same products from flooding with AI-generated child sexual imagery. As Forbes reports, image-generating AI tools have given rise to a disturbing new wave of sexualized images of children, which are proliferating throughout social media — the Forbes report focused on TikTok and Instagram — and across the web.
What is Meta doing or not doing? The write up is short on technical details. In fact, there are no technical details. Is it possible that any online service allowing anyone able to comment or upload certain content will do something “bad”? Online requires something that most people don’t want. The secret ingredient is spelling out an editorial policy and making decisions about what is appropriate or inappropriate for an “audience.” Note that I have converted digital addicts into an audience, albeit one that participates.
Two fictional characters are supposed to be working hard and doing their level best. Thanks, MSFT Copilot. How has that Cloud outage affected the push to more secure systems? Hello, hello, are you there?
Editorial policies require considerable intellectual effort, crafted workflow processes, and oversight. Who does the overseeing? In the good old days when publishing outfits like John Wiley & Sons-type or Oxford University Press-type outfits were gatekeepers, individuals who met the cultural standards were able to work their way up the bureaucratic rock wall. Now the mantra is the same as the probability-based game show with three doors and “Come on down!” Okay, “users” come on down, wallow in anonymity, exploit a lack of consequences, and surf on the darker waves of human thought. Online makes clear that people who read Kant, volunteer to help the homeless, and respect the rights of others are often at risk from the denizens of the psychological night.
Personally I am not a Facebook person, a users or Instagram, or a person requiring the cloak of a WhatsApp logo. Futurism takes a reasonably stand:
it’s [Meta, Facebook, et al] clearly unable to use the tools at its disposal, AI included, to help stop harmful AI content created using similar tools to those that Meta is building from disseminating across its own platforms. We were promised creativity-boosting innovation. What we’re getting at Meta is a platform-eroding pile of abusive filth that the company is clearly unable to manage at scale.
How long has been Meta trying to be a squeaky-clean information purveyor? Is the article going overboard?
I don’t have answers, but after years of verbal fancy dancing, progress may be parked at a rest stop on the information superhighway. Who is the driver of the Meta construct? If you know, that is the person to whom one must address suggestions about content. What if that entity does not listen and act? Government officials will take action, right?
PS. Is it my imagination or is Futurism.com becoming a bit more strident?
Stephen E Arnold, May 27, 2024
Content Mastication: A Controversial Business Tactic
January 25, 2024
This essay is the work of a dumb dinobaby. No smart software required.
In the midst of the unfolding copyright issues, I found this post quite interesting. Torrent Freak published a story titled “Meta Admits Use of ‘Pirated’ Book Dataset to Train AI.” Is the story spot on? I sure don’t know. Nevertheless, the headline is a magnetic one. The story reports:
The cases allege that tech companies, including Meta and OpenAI, used the controversial Books3 dataset to train their models. The Books3 dataset has a clear piracy angle. It was created by AI researcher Shawn Presser in 2020, who scraped the library of ‘pirate’ site Bibliotik. This book archive was publicly hosted by digital archiving collective ‘The Eye‘ at the time, alongside various other data sources.
A combination of old-fashioned content collection and smart systems move information from Point A (a copyright owner’s night table) to a smart software system. MSFT’s second class Copilot Bing thing created this cartoon. Sigh. Not even good enough now in my opinion.
What was in the Books3 data collection? The TF story elucidates:
The general vision was that the plaintext collection of more than 195,000 books, which is nearly 37GB…
What did Meta allegedly do to make its Llama smarter than the average member of the Camelidae family? Let’s roll the TF quote:
Responding to a lawsuit from writer/comedian Sarah Silverman, author Richard Kadrey, and other rights holders, the tech giant admits that “portions of Books3” were used to train the Llama AI model before its public release. “Meta admits that it used portions of the Books3 dataset, among many other materials, to train Llama 1 and Llama 2,” Meta writes in its answer [to a court].
The article does not include any statements like “Thank you for the question” or “I don’t know. My team will provide the answer at the earliest possible moment.” Nope. Just an alleged admission.
How will the Meta and parallel copyright legal matter evolve? Beyond Search has zero clue. The US judicial system has deep and mysterious logic. One thing is certain: Senior executives do not like uncertainty and risk. The copyright litigation seems tailored to cause some techno feudalists to imagine a world in which laws, annoying regulators, and people yapping about intellectual property were nudged into a different line of work. One example which comes to mind is building secure bunkers or taking care of the lawn.
Stephen E Arnold, January 25, 2024