Old Newspapers, Now in Jpeg Format

May 6, 2011

In a former work life, some of my colleagues had a brush with microfilming newspapers. Yep, microfilm, chemicals, scratches, and EPA baiting effluvia.Imagine my surprise when I read about Smart Team’s new solution to hard copy information in “OCR Newspaper, are you kidding me?” The company’s Steven Wang allegedly said:

Smart Team has built a solution for a global news agency to Full-Page-OCR its millions of decade old newspaper archive in jpeg format. The solution is built upon Autonomy Teleform and LiquidOffice. The whole solution let the agents to snatch paragraphs and pictures on the image, auto OCR the content, then verify and eventually export the content into the Newspaper Archive, a data warehouse, and index them into Autonomy IDOL.

This is an interesting use of technology. The write-up lists several challenges that have deterred this attempt in the past and describes how Smart Team has overcome them. The ability to convert optical scans into semantically-searchable data is quite an accomplishment. According to the firm’s Web site:

Smart Team is a software service firm focused on providing Enterprise Software solutions, specialized on customization and integration around Autonomy, Zeus products. We continue to add new modules to our SMART Library and provide outstanding support services to support your organization…

The choice of Autonomy as an engine or platform is a good one. Hopefully the company will find a market. Libraries once were hungry for certain types of material in image form. With constrained budgets, sales may require time and effort.

Cynthia Murrell May 6, 2011

Freebie

Written by Stephen E. Arnold · Filed Under Business strategy, News, Technology, Text processing

Comments

One Response to “Old Newspapers, Now in Jpeg Format”

Charlie on May 6th, 2011 7:07 am

The article says the system “can generate a Text PDF automatically with 85% of the text recognized for images less than 150 dpi”. In my experience 85% is far too low to be useful – there are OCR solutions being used by some of our customers that perform much better than this, and we implement automatic OCR corrections in the later search stages to help with any errors.

It’s not much use having a system that (however innaccurately) claims to do Meaning Based Search when you’re missing 15% of the original content.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.