AI Crawlers Are Bullying Open Source: Stop Grousing and Go Away

April 25, 2025

AI algorithms are built on open source technology. Unfortunately generative AI is harming its mother code explains TechDirt: “AI Crawlers Are Harming Wikimedia, Bringing Open Source Sites To Their Knees, And Putting The Open Web At Risk.” To make generative AI work you need a lot of computer power, smart coding, and mounds of training data. Money can buy coding and power, but (quality) training data is incredibly difficult to obtain.

AI crawlers were unleashed on the Internet to scrap information and use it for training models. The biggest information providers for crawlers are Wikimedia projects and it’s a big problem. Wikimedia, which claims to be “the largest collection of open knowledge in the world,” says most of its traffic is from crawlers and it is eating into costs:

“Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.”

This is bad because it is straining the Wikimedia datacenter and budgetary resources. Wikimedia isn’t the only information source feeling the burn from AI crawlers. News sites and more are being wrung by crawlers for every decimal of information:

“It’s increasingly clear that the reckless and selfish way in which AI crawlers are being deployed by companies eager to tap into today’s AI hype is bringing many sites around the Internet to their knees. As a result, AI crawlers are beginning to threaten the open Web itself, and thus the frictionless access to knowledge that it has provided to general users for the last 30 years.”

Silicon Valley might have good intentions but dollars are more important. (Oh, I am not sure about the “good intentions.”)

Whitney Grace, April 25, 2025

JudyRecords: Is It Back or Did It Never Go Away?

April 22, 2025

dino orange_thumb_thumb_thumbBelieve it or not, no smart software. Just a dumb and skeptical dinobaby.

I was delighted to see that JudyRecords is back online. Here’s what the service says as of April 19, 2025:

Judyrecords is a 100% free nationwide search engine that lets you instantly search hundreds of millions of United States court cases and lawsuits.judyrecords has over 100x more cases than Google Scholar and 10x more cases than PACER, the official case management system of the United States federal judiciary.As of Jul 2022, judyrecords now features free full-text search of all United States patents from 1/1/1976 to 07/01/2022 — over 8.1 million patents in total.

My thought is that lawyers, law students, and dinobabies like me will find the service quite useful.

The JudyRecords’ Web site adds:

The first 500K results are displayed instead of just the first 2K.

  • murder – 926K cases
  • fraud – 2.1 million cases
  • burglary – 3.7 million cases
  • assault – 8.2 million cases

Most people don’t realize that the other “free” search engines limit the number of hits shown to the user. The old-fashioned ideas of precision and recall are not operative with most of the people whom I encounter. At the Googleplex, precision and recall are treated like a snappy joke when the Sundar & Prabhakar Comedy Show appears in a major venue like courtrooms.

If you want to control the results, JudyRecords provides old-fashioned and definitely unpopular methods such as Boolean logic. I can visualize the GenZs rolling their eyes and mouthing, “Are you crazy, old man?”

Please, check out JudyRecords because the outstanding management visionaries at LexisNexis, Thomson Reuters, and other “professional” publishers will be taking a look themselves.

Stephen E Arnold, April 22, 2025

Sweden Has a Social Fabric: The Library Pattern

November 29, 2023

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

If a building is left unlocked and unattended, it’s all but guaranteed that people will trespass, rob, and vandalize. While we expect the worst from humanity, sometimes our faith in the species is restored with amazing stories like this from ZME Science: “A Door At A Swedish Library Was Accidentally Left Open-446 People Came In, Borrowed 245 Books. Every Single One Was Returned.”

Gothenburg librarian Anna Carin Elf noticed something odd when she went to work one day. The library was supposed to be closed because it was All Saint’s Day. People, however, were browsing shelves, reading, using computers, and playing in the children’s section. A member of the library staff forgot to shut one of the doors. The next day patrons took advantage of the open door and used the facilities.

When Elf saw the library was open she jumped into action:

"As people were coming in and out of the library, one librarian (Elf) walked by and noticed the people using the library. She realized what was happening, called her manager and a colleague, and then announced that the library was closing. The visitors calmly folded their books closed and left. But some left with books.”

When the library was accidentally left open, the people of Gothenburg borrowed 245 books and every single one was returned. It’s wonderful when communities recognize the importance of libraries and decide to respect them. Libraries continue to be an important part of cities as they provide access to information, Internet, books, activities, and more.

Whitney Grace, November 29, 2023

Racy Poetry Now Available

October 26, 2023

green-dino_thumbThis essay is the work of a dumb humanoid. No smart software required.

My hunch is that you either have forgotten or we not aware of the Wife of Bath. Well, let me tell you that was a hot read in the 16th century. Now you can review the pre-1600 manuscripts of Chaucer’s works. Many years ago my professor for a 15 week class in Chaucer was one of the editors of the then standard text of Chaucer’s poetry. I think his name was J.J. Campbell.

image

Microsoft’s art generator thinks that the Wife of Bath looks like this machine-generated image. I don’t think the dreamy pix matches my reconstruction of the Wife of Bath, who wore red socks and a method to generate hard cash on demand.

What he did, I think, was get students like me to undertake specific research and write papers about the topic. My assignments involved tracking references to the even more salacious volumes (at least in the 16th century) of the Apocrypha. Imagine the fun that was. The British Library has digitized the manuscripts and books. These are available at this link. How long will it take Alamy, Getty, and other image wizards to suck out the images and charge people for the use of content created centuries ago? Not long. Not long at all. By the way, watch out for friars in the woods.

Stephen E Arnold, October 26, 2023

Paper Envisions an Open Science Platform for Chemistry Researchers

July 14, 2022

What could be accomplished if machine learning were harnessed to help scientists connect, collaborate, and build on each other’s findings? A team of researchers ponders “Making the Collective Knowledge of Chemistry Open and Machine Actionable.” Researchers Kevin Maik Jablonka, Luc Patiny, and Berend Smit hope their suggestions will bring the field of chemistry closer to FAIR principles (findable, accessible, interoperable, and reusable). The paper, published by Nature Chemistry, observes:

“Chemical research is still largely centered around paper-based lab notebooks, and the publication of data is often more an afterthought than an integral part of the process. Here we argue that a modular open-science platform for chemistry would be beneficial not only for data-mining studies but also, well beyond that, for the entire chemistry community. Much progress has been made over the past few years in developing technologies such as electronic lab notebooks that aim to address data-management concerns. This will help make chemical data reusable, however it is only one step. We highlight the importance of centering open-science initiatives around open, machine-actionable data and emphasize that most of the required technologies already exist—we only need to connect, polish and embrace them.”

The authors go on to describe how to do just that using structured and open data with semantic tools. In order to make the transition as smooth as possible, the team suggests data capture should be similar to the way chemists already work. Data should also be generated in a standardized format other researchers can easily use. A formal ontology will be important here. For consistency and accessibility, the paper also recommends building a modular data-analysis platform with a common interface and standardized protocols. This open-science platform would replace the hodgepodge of different, often proprietary, tools currently in use. It would also make publication of data a seamless, and centralized, part of the process. See the paper for all the details. The authors conclude:

“We emphasize that the technology is here not only to facilitate the process of publishing data in a FAIR format to satisfy the sponsors, but also to ensure that the combination of chemical data, FAIR principles and openness gives scientists the possibility to harvest all data so that all chemists can have access to the collective knowledge of everybody’s successful, partly successful and even ‘failed’ experiments.”

Cynthia Murrell, July 14, 2022

ACM Opens Computing Literature Archive

May 30, 2022

The history of computers is fascinating. It starts thousands of years ago with some of history’s brilliant intellects, staggers, and then quickly advances in the twentieth century. We now have humanity’s collected knowledge in the palm of our hands…if the data or Wi-Fi connections work. The Association for Computing Machinery documented the invention of modern computing since 1947 and the organization opened an archive: ACM Digital Library. Associations Now explain why ACM opened its archives in, “‘The Way Things Were’: How The Association For Computing Machinery Is Opening The Doors To Its Archives.”

ACM wants people to realize how far the computing industry has gone and for its seventy-fifth anniversary is opening up its archives to the public. In the past, these records were locked behind a paywall and now they are free to the public. More than 117,500 articles from 1951-2000 are readable to the public. The archive is part of a greater ACM initiative:

“Vicki L. Hanson, the group’s CEO, noted that the ACM Digital Library initiative is part of a broader effort to make its archives available via open access by 2025. ‘Our goal is to have it open in a few years, but there’s very real costs associated with [the open-access work],’ Hanson said. ‘We have models so that we can pay for it. While the organization is still working through its open-access effort, it saw an opportunity to make its “backfile” of materials available, timed to the organization’s 75th anniversary.”

Hanson continued that opening the archive was not a big challenge, because ACM already had a system designed for public consumption. ACM wanted a creative way to announce the archive, so they used its seventy-fifth anniversary.

Organizations need to make money to support their research, but too much scientific information is kept behind paywalls. ACM’s move to share its research is a step more organizations should make.

Whitney Grace, May 30, 2022

Viva The Academic Publisher Boycott!

July 30, 2015

Academic databases provide access to quality research material, which is key for any student, professor, or researcher to succeed in their work.  One major drawback to academic databases is the high cost associated with subscription fees.  Individual researchers cannot justify subscribing to an academic database and purchasing a single article runs high.  This is why they rely on academic libraries to cover the costs.  Due to changing publishing trends, academic publishers are raising subscription fees.

Elsevier is one of the largest and most well-known scientific journal database, but it is also the most notorious for its expensive subscription fee and universities are getting tired of it.  Univers reports that “Dutch Universities Start Their Elsevier Boycott.”  The Netherlands, led by state secretary Sander Dekker, want all scientific content to be free online.  In order to be published, the university or financier pays to be so.  All content by Dutch scientists will hopefully be open access by 2024.

In the meantime, the Association of Universities in the Netherlands has asked all Dutch scientists that work with Elsevier to resign from their positions.  As to be expected, some are willing and others are more reluctant.  The goal is to pressure Elsevier to change its practices.

“In Univers nr. 8, in January, professor Jan Blommaert called the current publishing system ‘completely absurd’. Not only because of the costs for subscription, but also because the journals have a lot of power over the content: ‘A young PhD student who has been able to get an article accepted by a journal may still have to wait 18 months for it to be published, because the editors prefer well-known names. It is not unthinkable that if I would submit a love letter, it would be published sooner than an intelligent scholarly article by a young researcher.’ ”

The Dutch universities are setting a standard that many libraries and universities will also follow, but the hardest part is encouraging more to participate.  Libraries and universities have an obligation to provide needed materials to researchers and a boycott will hinder the step.  Large boycotts, rather than individual, will be more effective and instrumental in changing Elsevier’s practices.

Whitney Grace, July 30, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

OMICS Publishing Group Threatens Billion Dollar Suit Over Slanderous Blog Post

June 14, 2013

The article on the Chronicle of Higher Education titled Publisher Threatens to Sue Blogger for One Billion tells the story of Jeffrey Beall, who runs the blog Scholarly Open Access. The blog has a list of predatory or questionable publishers and journals. One of these publishers, OMICS Publishing Group from India, wants to sue Beall, and put him in jail.

“The OMICS Group’s practices have received particular attention from Mr. Beall and some publications, including The Chronicle. In 2012, The Chronicle found that the group was listing 200 journals, but only about 60 percent had actually published anything… On his blog, Mr. Beall accuses OMICS of spamming scholars with invitations to publish, quickly accepting their papers, then charging them a nearly $3,000 publishing fee after a paper has been accepted.”

The letter sent to Beall accused him of racial discrimination as well as unprofessionalism. Whether the suit is a publicity stunt or sparked by legitimate outrage is unclear, but in India it is against the law to publish “menacing” information online under Section 66A of the Information Technology Act. Is pay to play content really such a contentious concept? The academics desperate to be published in legitimate journals who follow Beall’s blog would certainly say yes.

Chelsea Kerwin, June 14, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Useful Source of Open Books

May 10, 2013

The good folks at O’Reilly Media offer a roster of Open Books for your inner programmer. O’Reilly has not only navigated the open license landscape to offer these publications, but also got them digitized into e-books so they can be available to anyone with an Internet connection. Though the publisher has offered books under various open copyrights for years, it now has a concerted focus in this area.

The write-up makes sure to give credit where credit is due:

“We’re happy to have partnered with two innovative nonprofits, Creative Commons and the Internet Archive, to solve the licensing and digitizing challenges involved in bringing Open Books to readers.

“While the books listed here use various open licenses, since 2003 we’ve focused on using the licenses created by Creative Commons. O’Reilly has adopted the Creative Commons Founders’ Copyright, which we’re applying to hundreds of out-of-print and current titles, pending author approval.

“Through its Open Library project, the Internet Archive is scanning and hosting PDF versions of our open books. We posted the first book, the original edition of The Whole Internet User’s Guide & Catalog in October of 2005, as part of the launch of the Open Content Alliance (we and the Internet Archive are among the founding members of the alliance).”

O’Reilly expresses gratitude to Creative Commons and the Internet Archive, and suggests users consider donating to these initiatives. (We concur.) Check out the generous list—you might just pick up some crucial information for free.

Cynthia Murrell, May 10, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Cengage: Time to Disengage?

March 25, 2013

Thomson Reuters in “Cengage Learning Hires Restructuring Advisers” reported that a former Thomson property is arranging a modest infusion of cash. “Modest” in this context is about $430 million, which is nothing when compared to the cost of a modern text book. (“See Textbook Prices Are Inflating Even Faster Than Tuition Prices: New Boston University Classifieds for Students Makes Buying Textbooks More Affordable.”)

Cengage used to be Thomson Learning, a sprawling collection of publishing companies. Some of the firms had traditional textbooks; others had combinations of traditional textbooks and electronic versions. My recollection is that the technical infrastructure of the original Thomson Learning was quite diverse. “Diverse” publishing infrastructures in the same organization add significantly to the costs of doing business. “Diverse” is also a stuck brake on innovation because repurposing content is time consuming and labor intensive. Prior to spinning off Thomson Learning to Apax Partners and Omers Capital Partners, Thomson’s senior management were focusing their considerable talents on cost efficiencies. . I assume that the technical infrastructure issues have been resolved.

Debt can be a burden as this illustration from Shape Home Loans suggests?i Does debt enhance agility or is it a financial play disconnected from structural changes such as those described in my “Gadzooks, It’s MOOCs: The Fuss over Open Source Learning” article?

One item in the Thomson Reuters news release caught my attention:

…the company said it had borrowed $430 million, almost all of its remaining credit facility to ensure its businesses have the cash they need. Stamford, Connecticut-based Cengage has a $1.5 billion term loan that matures next year and a total of $5.3 billion of debt as of Dec. 31.

Several observations:

First, this type of cash crunch in publishing is likely to become more common. I wrote a story for Online Searcher about the impact of online learning. There is also a chorus of “if you are smart, you can skip college” echoing around Kentucky. What if the online learning and the “you don’t have to go to college” blend? Companies depending upon the traditional purchasing patterns in education may find that new revenues are not sufficient to keep up with old revenue losses.

Second, the spillover from a Cengage-type of problem will have cascading effects. Examples which come to mind are revenues flowing to such organizations like Ebsco Electronic Publishing, ProQuest, and Wolters Kluwer. These companies are in the education food chain. If Cengage flu becomes contagious, these firms will face some additional financial challenges.

Third, the authors who provide content to the textbook giants have to be paid. With the shift to online courses, some of these authors may take their “fame” and their content and go a new direction. It is now possible for some textbook superstar authors to try to become celebrities. If Google needs knowledge, the company just hires the superstar. Won’t the same approach become possible in the online learning space? Maybe an existing textbook company will corner this market? I am not sure  traditional textbook companies have the agility necessary to pull off a slam dunk.

Fourth, the online services like Thomson Reuters’ WestLaw and Reed Elsevier’s LexisNexis may also feel the impact of a shift. On one hand, these systems could gain new content from disaffected textbook publishers and, therefore, more revenue pulling information. On the other hand, traditional online services have been caught flatfooted by the surge in online educational content and may be too late to ride the new revenue train.

Net net: Is it time for customers of Cengage to disengage? A larger question is, “Will the professional publishing and professional online services be able to adjust to yet another sapping of their life blood?” Changes are coming. Many of these shifts will not be gentle, kind, or slow I fear.

Stephen E Arnold, March 25,2013

Next Page »

  • Archives

  • Recent Posts

  • Meta