Internet Archive Scholar: Will Publishers Find a Way to Stomp This Free Knowledge Beast?

January 12, 2023

Here is a new search service worth noting. The Internet Archive Scholar was built to search the extensive, non-profit Internet Archive. The tool introduces itself:

“This full text search index includes over 25 million research articles and other scholarly documents preserved in the Internet Archive. The collection spans from digitized copies of eighteenth century journals through the latest Open Access conference proceedings and pre-prints crawled from the World Wide Web.”

Yes, that is a lot of information and a dedicated search system is a welcome addition. If only it were easier to find what one is looking for; the search leaves some on the Arnold IT team wanting more functionality. But the service is young, and the page notes that “Metadata is being improved and features have not been finalized.“

The About page tells us more about how the tool works, where the metadata comes from (fatcat.wiki), and where to direct certain queries. It also addresses the issue of text and data mining:

“We intend to provide researcher access to the full corpus for text and data mining purposes. Derived datasets may also be posted publicly for analysis, for example a citation graph or N-gram frequencies by year. If you are interested or would like to see specific datasets made available, please contact us.

Currently snapshots of the full fatcat metadata corpus and upstream metadata sources are uploaded periodically to the Bulk Bibliographic Metadata collection on archive.org. Read more in the Fatcat Guide.”

We look forward to seeing what functionality improvements the team implements as the Scholar is developed further. Readers may want to check it out for themselves and/or bookmark the site for future use. We are also curious about publishers’ reactions.

Cynthia Murrell, January 12, 2023

Written by Stephen E. Arnold · Filed Under News, Publishing

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.