Old Newspapers, Now in Jpeg Format
May 6, 2011
In a former work life, some of my colleagues had a brush with microfilming newspapers. Yep, microfilm, chemicals, scratches, and EPA baiting effluvia.Imagine my surprise when I read about Smart Team’s new solution to hard copy information in “OCR Newspaper, are you kidding me?” The company’s Steven Wang allegedly said:
Smart Team has built a solution for a global news agency to Full-Page-OCR its millions of decade old newspaper archive in jpeg format. The solution is built upon Autonomy Teleform and LiquidOffice. The whole solution let the agents to snatch paragraphs and pictures on the image, auto OCR the content, then verify and eventually export the content into the Newspaper Archive, a data warehouse, and index them into Autonomy IDOL.
This is an interesting use of technology. The write-up lists several challenges that have deterred this attempt in the past and describes how Smart Team has overcome them. The ability to convert optical scans into semantically-searchable data is quite an accomplishment. According to the firm’s Web site:
Smart Team is a software service firm focused on providing Enterprise Software solutions, specialized on customization and integration around Autonomy, Zeus products. We continue to add new modules to our SMART Library and provide outstanding support services to support your organization…
The choice of Autonomy as an engine or platform is a good one. Hopefully the company will find a market. Libraries once were hungry for certain types of material in image form. With constrained budgets, sales may require time and effort.
Cynthia Murrell May 6, 2011
Freebie
Comments
One Response to “Old Newspapers, Now in Jpeg Format”
The article says the system “can generate a Text PDF automatically with 85% of the text recognized for images less than 150 dpi”. In my experience 85% is far too low to be useful – there are OCR solutions being used by some of our customers that perform much better than this, and we implement automatic OCR corrections in the later search stages to help with any errors.
It’s not much use having a system that (however innaccurately) claims to do Meaning Based Search when you’re missing 15% of the original content.