Newspaper Search: Another Findability Challenge
October 13, 2020
Here is an interesting project any American-history enthusiast could get lost in for hours: Newspaper Navigator. I watched the home page’s 15-minute video, which gives both an explanation of the search tool’s development and a demo. Then I played around with the tool for a bit. Here’s what I learned.
Created by Ben Lee, the Library of Congress’ 2020 Innovator in Residence, The Newspaper Navigator is built on the Library of Congress’s Chronicling America, a search portal that allows one perform keyword searches on 16 million pages of historical US newspapers using optical character recognition. That is a great resource—but how to go about an image search for such a collection? That’s where Newspaper Navigator comes in.
Lee used thousands of annotations of the collection’s visual content, created by volunteers in the Library’s Beyond Words crowdsourcing initiative of 2017, to train a machine learning model to recognize visual content. (He released the dataset, which can be found here. He also created hundreds of prepackaged downloadable datasets organized by year and type, like maps, photos, cartoons, etcetera.) The Newspaper Navigator search interface allows users to plumb 1.5 million high-confidence, public-domain photos from newspapers published between 1900-1963. The app allows for standard search, but the juicy bit is the ability to search by visual similarity using machine learning.
Lee walks us through two demo searches—one that begins with the keyword “baseball” and with “sailboat.” One can filter by location and time frame, then hover over results to get more info on the image itself and the paper in which it appeared. Select images to build a Collection, then tap into the AI prowess via the “Train my AI Navigators” button. The AI uses the selected images to generate a page of similar images, each with a clickable + or – button. Clicking these tells the tool which images are more and which are less like what is desired. Click “Train my AI Navigators” again to generate a more refined page, and repeat until only (or almost only) the desired type of image appears. When that happens, clicking the Save button creates a URL to take one right back to those results later.
Lee notes that machine learning is not perfect, and some searches lend themselves to refinement better than others. He suggests starting again and retraining if results start refining themselves in the wrong direction.
The video acknowledges the potential marginalization issues in any machine learning project. Click on the Data Archaeology tab to read about Lee’s investigation of the Navigator dataset and app from the perspective of bias.
I suggest curious readers play around with the search app for themselves. Lee closes by inviting users to share their experiences through LC-Labs@loc.gov or on twitter @LC_Labs, #NewspaperNavigator.
Cynthia Murrell, October 13, 2020