The British Library Channels University Microfilms and the Google

September 1, 2021

While a quick Google search can yield pertinent information, it is hard to find. Why? Google search results are clogged with paid ads and Web sites that are not authoritative sources. Newspapers are still a valuable resource, especially newspapers from before the Internet’s invention. The brilliant news is, as IanVisits shares, is that, “The British Library Puts 1 Million Newspaper Pages Online For Free.”

The British Newspaper Archive contains over forty-four million newspaper pages that range from 1600-2009. The newspapers are from British and Irish sources and they are over 10% of the newspapers the British Library owns. Around half a million pages are added the archive every month.

The newspapers currently require a subscription, but all funds go to scanning more pages to the archive. The British Newspaper Archive has released one million pages for free and plans to add another million over the next four years. Not all pages will be free, however:

“They won’t add all papers, as they say that while they consider newspapers made before 1881 to be in the public domain, that does not mean that will make all pre-1881 digitized titles available for free, as the archive is dependent on subscriptions to cover its costs. If like me you do a lot of historical research, then the cost of the full subscription is not that bad – just £80 a year for the full archive.”

The archive offers 158 free newspaper titles that range from 1720-1880. All of the newspapers that fall within this date range are in the public domain.

It would be awesome if all newspapers were available for free on the Internet, but money makes the world go round. Libraries and universities offer free access to newspaper databases and subscription services, in most cases, are not that expensive.

The good news is that researchers may have access to news stories infused with some of that good old “real” journalistic wire tapping.

Whitney Grace, September 1, 2021

The Internet Archive Dons a Scholar Skin

April 23, 2021

Some of today’s biggest social faux pas are believing everything on the Internet, clicking the first link in search results, and buying items from questionable Internet ads. It is easy to forget that search engines like Google and Bing are for-profit search engines that put paid links at the top of search results. What is even worse is scientific and scholarly information is locked behind expensive paywalls.

Wikipedia is often believed to be a reliable source, but despite the dedication of wiki editors the encyclopedia is not 100% accurate. There are free scholarly databases and newspapers often have their archives online, but that information is not widely known.

Thankfully the Internet Archive is fairly famous. The Internet Archive is a non-profit digital library that provides users with access to millions of free books, music, Web sites, videos, and software. They also allow users to peruse old Web sites with the Wayback Machine.

The Internet Archive recently introduced a brand new service that is sheer genius: Internet Archive Scholar. It is described as:

“This full text search index includes over 25 million research articles and other scholarly documents preserved in the Internet Archive. The collection spans from digitized copies of eighteenth century journals through the latest Open Access conference proceedings and pre-prints crawled from the World Wide Web.”

Why did no one at the Internet Archive think of doing this before? It is a brilliant idea that localizes millions of scholarly articles and other information without paywalls, university matriculation, or a library card. Most of the information available through the Internet Archive Scholar would otherwise remain buried in Google search results or on the Web, like old books gathering dust on library shelves.

Internet Archive Scholar is still in the beta phase and enhancements are a positive step.

Whitney Grace, April 23, 2021

IA Scholar: A Reminder That Existing Online Resources Are Not Comprehensive

March 10, 2021

We spotted this announcement from the Internet Archive in “Search Scholarly Materials Preserved in the Internet Archive.”

IA Scholar is a simple, access-oriented interface to content identified across several Internet Archive collections, including web archives, files, and digitized print materials. The full text of articles is searchable for users that are hunting for particular phrases or keywords. This complements our existing full-text search index of millions of digitized books and other documents on The service builds on Fatcat, an open catalog we have developed to identify at-risk and web-published open scholarly outputs that can benefit from long-term preservation, additional metadata, and perpetual access. Fatcat includes resources that may be useful to librarians and archivists, such as bulk metadata dumps, a read/write API, command-line tool, and file-level archival metadata. If you are interested in collaborating with us, or are a researcher interested in text analysis applications, we have a public chat channel or can be contacted by email at

I ran several queries. The system is set up to respond to a conference name, but free text entries worked find; for example, NLP. Here are the results:


Worth checking out. In my experience people who are “experts” in online often forget that no online service is up to date, comprehensive, and set up to deliver full text. One other point: Corrections to online content are rarely, if ever made. Business Dateline, produced by the Courier Journal and Louisville Times in the early 1980s was one of the first commercial databases to include corrections. Thumbtypers may not care, but that’s the zippy modern world.

Stephen E Arnold, March 10, 2021

Comments about Web Search: Prompted by a Hacker News Thread

November 13, 2020

I spotted a Web search related threat on Hacker News. You can locate the comments at this link. Several observations:

  1. Metasearch. Confusion seems to exist between a dedicated Web search system like Bing, Google, and Yandex and metasearch systems like DuckDuckGo and Startpage. Dedicated Web search systems require considerable effort, but there is less appreciation for the depth of the crawl, the index updating cycle, and similar factors.
  2. Competitors to Google. The comments present a list of search systems which are relatively well known. Omitted are some other services; for example, iSeek, Swisscows, and 50kft.
  3. Bias. The comments do not highlight some of the biases of Web search systems; for example, when are pages reindexed, what pages are on a slow or never update cycle, blacklisted, or processed against a stop word list.

So what?

  1. Many profess to be experts at finding information online. The comments suggest that perception is different from reality.
  2. Locating content on publicly accessible Web sites is more difficult than at any other time in my professional career in the online information sector.
  3. Locating relevant information is increasingly time consuming because predictive, personalized, and wisdom of crowd results don’t work; for example, run this query on any of the search engines:

Voyager search

Did your results point to the Voyager Labs’s system, the UK HR company’s search engine, a venture capital firm, or a Lucene repackager in Orange County? What about Voyager patents?  What about Voyager customers?

How can one disambiguate when the index scope is unknown, entity extraction is almost non existent, and deduplication almost laughable? Real time? Ho ho ho.

One can do this work manually. Who wants to volunteer for that. The most innovative specialized search vendors try to automate the process. Some of these systems are helpful; most are not.

Is search getting better? Rerun that Voyager search. See for yourself.

Without field codes, Boolean, and a mechanism to search across publicly accessible content domains, Web search reveals its shortcomings to those who care to look.

Not many look, including professionals at some of the better known Web search outfits.

Stephen E Arnold, November 13, 2020

Science: Just Delete It

September 10, 2020

The information in “Dozens of Scientific Journals Have Vanished from the Internet, and No One Preserved Them” may remind some people that the “world’s information” and the “Internet archives” are marketing sizzle. The steak is the source document. The FBI has used the phrase “going dark” as shorthand for not being able to access certain information. The thrill of not have potentially useful information is one that most researchers prefer to reserve for thrill rides at Legoland.

The write up states:

Eighty-four online-only, open-access (OA) journals in the sciences, and nearly 100 more in the social sciences and humanities, have disappeared from the internet over the past 2 decades as publishers stopped maintaining them, potentially depriving scholars of useful research findings, a study has found. An additional 900 journals published only online also may be at risk of vanishing because they are inactive, says a preprint posted on 3 September on the arXiv server. The number of OA journals tripled from 2009 to 2019, and on average the vanished titles operated for nearly 10 years before going dark, which “might imply that a large number … is yet to vanish…

Flat earthers and those who believe that “just being” is a substitute for academic rigor are probably going to have “thank goodness, these documents are gone” party. I won’t be attending.

Anti-intellectualism is really exciting. Plus, it makes life a lot easier for those in the top one percent of intellectual capability. Why? Extensive reading can fill in some blanks. Who wants to be comprehensive? Oh, I know: “Those who consume TikTok videos and devour Instagram while checking WhatsApp messages.”

Stephen E Arnold, September  10, 2020

A Librarian Looks at Google Dorking

August 24, 2020

In order to find solutions for their jobs, many people simply conduct a Google search. Google searching for solutions is practiced by teachers to executives to even software developers. Software developers spend an inordinate amount of their time searching for code libraries and language tutorials. One developer named Alec had the brilliant idea to create “dorking.” What is dorking?

“Use advanced Google Search to find any webpage, emails, info, or secrets

cost: $0

time: 2 minutes

Software engineers have long joked about how much of their job is simply Googling things

Now you can do the same, but for free”

Dorking is free! That is great! How does it work? Dorking is a tip guide using Boolean operators and other Google advanced search options to locate information. Dorking, however, does need a bit of coding knowledge to understand how it works.

Most some of these tips can be plugged into a Google search box, such as finding similar sites and find specific pages that must include a phrase in the Title text. Others need that coding knowledge to make them work. For example finding every email on a Web page requires this:


Yep, dorking for everyone.

After a few practice trials, these dorking tips are sure to work for even the most novice of Googlers. It will also make anyone, not just software developers, appear like experts. As a librarian, why not assign field types and codes, return Boolean logic, and respect existing Google operators. Putting a word in quotes and then getting a result without the word is — how should I frame it. I know — dorky.

Whitney Grace, MLS, August 24, 2020

Kaggle ArXiv Dataset

August 7, 2020

“Leveraging ML to Fuel New Discoveries with the ArXiv Dataset” announces that more than 1.7 million journal-type papers are available without charge on Kaggle. DarkCyber learned:

To help make the ArXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable ArXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

What’s Kaggle? The article explains:

Kaggle is a destination for data scientists and machine learning engineers seeking interesting datasets, public notebooks, and competitions. Researchers can utilize Kaggle’s extensive data exploration tools and easily share their relevant scripts and output with others.

The ArXiv contain metadata for each processed paper (document), including these fields:

  • ID: ArXiv ID (can be used to access the paper, see below)
  • Submitter: Who submitted the paper
  • Authors: Authors of the paper
  • Title: Title of the paper
  • Comments: Additional info, such as number of pages and figures
  • Journal-ref: Information about the journal the paper was published in
  • DOI: [](Digital Object Identifier)
  • Abstract: The abstract of the paper
  • Categories: Categories / tags in the ArXiv system
  • Versions: A version history

Details about the data and their location appear at this link. You can use the ArXiv ID to download a paper.

What if you want to search the collection? You may want to download the terabyte plus file and index the json using your favorite search utility. There’s a search system available from ArXiv and you can use the site: operator on Bing or Google to see if one of those ad-supported services will point you to the document set you need.

DarkCyber wants to suggest that you download the corpus now (datasets can go missing) and use your favorite search and retrieval system or content processing system to locate and make sense of the ArXiv content objects.

Stephen E Arnold, August 7, 2020

French Computer Terminology

August 1, 2020

This is a helpful resource. However, the term for “spreadsheet” is not included. If you want that spreadsheet holding a summary of your electricity bills, be sure to know the word “tableur.” You can find the collection of terms at this link. The compilation is not une faute passible d’un coup franc, but let’s check with the video assisted referee to be sure.

Stephen E Arnold, August 1, 2020

Online Books

June 16, 2020

The Internet Archive has pulled in its digital tentacles. Are there collections of online books that will not attract law suits from increasingly stressed “real” publishers?

The answer is, “Sort of.”

For a listing of “over three million free books on the Web”, point your Mother Hen browser at “The Online Books Page.” Some exploration is needed. The categories are not exactly easy to use, but what online index is these days.

The “Search Our Listings” lets a user search by author’s last name and title. The problem is, as many grade school students know, is that an author’s name can return many listings. To see what I mean, plug in “Plato”. There you go. A list of books that will dissuade some from locating the old guy who argued with Socrates (not the football playing medical doctor from Brazil).

You can also access a feature called “Exclude extended shelves.” Despite the name, the NOT function delivers the goods. Why make Boolean into something that makes little sense?

The new listings option delivers an earthworm result. Like to browse, this is your Disneyland. Want magazines? Just click “Serials.” This page leads to more pages listing magazines. Some of the journals in the link to the Electronic Journals Library are not free. Well, free is relative, I suppose.

The effort to gather the information is admirable. Polishing, editorial control, and consistent presentation may arrive in the future.

Worth checking into an author with whom one is familiar. Browsing can be interesting. Years ago I told a former client that no firm had a comprehensive index of electronic books. That company’s young and confident managers did not believe me. Flash forward to 2020, the problem still exists. There you go.

Stephen E Arnold, June 16, 2020

Bookmarks and the Dynamic Web: Yes, Still a Problem

June 3, 2020

Apparently, bookmarks are a thing. Again. Memex from is an open source browser extension that allows users to annotate, search, and organize online information locally. The offline functionality supports both privacy and data ownership. It is available for Chrome, Firefox, and Brave browsers, and now offers a mobile app called Memex Go. The product page lists these features:

Full Text History Search: Automatically indexes websites you visit. Instantly recover anything you’ve seen without upfront work.

Highlights & Annotations: Keep your thoughts organized with their original context.

Tags, Lists & Bookmarks: Quickly organize content via the sidebar or keyboard shortcuts.

Quickly save & organize content on the go: Encrypted sync between your computer, iOS and Android devices.

Your Data and Attention are yours: Memex is offline first & introduced a cap on investor returns so we don’t exploit your attention and data to maximize investor profits.

The page illustrates each feature with a dynamic screen shot, so check it out for more details. You can also click here to learn more about their financial philosophy. The Basic version of Memex is free, while the Pro version costs € 2 per month or € 20 per year (after the 14-day free trial). hopes its software will contribute to a “well-informed and less polarized global society.” Based in Berlin, the company was founded in 2017.

Cynthia Murrell, June 3, 2020

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta