Set Data Free from PDF Tables

April 13, 2015

The PDF file is a wonderful thing. It takes up less space than alternatives, and everyone with a computer should be able to open one. However, it is not so easy to pull data from a table within a PDF document. Now, Computerworld informs us about a “Free Tool to Extract Data from PDFs: Tabula.” Created by journalists with assistance from organizations like Knight-Mozilla OpenNews, the New York Times and La Nación DATA, Tabula plucks data from tables within these files. Reporter Sharon Machlis writes:

“To use, download the software from the project website . It runs locally in your browser and requires a Java Runtime Environment compatible with Java 6 or 7. Import a PDF and then select the area of a table you want to turn into usable data. You’ll have the option of downloading as a comma- or tab-separated file as well as copying it to your clipboard.

“You’ll also be able to look at the data it captures before you save it, which I’d highly recommend. It can be easy to miss a column and especially a row when making a selection.”

See the write-up for a video of Tabula at work on a Windows system. A couple caveats: the tool will not work with scanned images. Also, the creators caution that, as of yet, Tabula  works best with simple table formats. Any developers who wish to get in on the project should navigate to its GitHub page here.

Cynthia Murrell, April 13, 2015

Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta