Adobe PDF: Maybe as Interesting As Flash?
March 4, 2020
Adobe Portable Document Format files flashed on DarkCyber’s radar in the mid 1980s. Adobe pitched the virtues of PDF to big publishing companies. And Stephen E Arnold worked at such an organization at this time. I was given the job of examining the early version of PDF referenced by the code named Trapeze.
Trapeze artists fall to their death. Adobe Acrobat pulled off a spectacular trick, survived, became sort of open, and now seems to be a permanent part of the landscape decorated with the dumpsters burning Microsoft XPS Document Writer files.
A very good write up about the problems PDF files is FilingDB’s “What’s So Hard about PDF Text Extraction?” The information in this write up makes explicit why PDFs are not easy to manipulate, analyze, and mine.
The write up provides the data needed to understand that when a vendor says, “We process the hidden content in PDF files”, those vendors do not explain how much and what is omitted, ignored, and unindexed.
People believe that when specifying a filetype: command to Bing or Google delivers comprehensive content from PDF files. No way, sad to say. The same problem exists for any search or content processing vendor’s connectors for PDF files.
This is important when one is conducting mission critical data analysis, certain investigations, and other types of work in which “zero error” is the goal. Will the problem be remediated. Maybe, but I spotted in the 1980s, and it persists today.
Stephen E Arnold, March 4, 2020