PDF Association: Flaws in Mueller Report Show What Not to Do With a PDF
The most famous PDF in the world—the long-awaited Mueller report—had critical technical flaws, according to the PDF Association, which used the opportunity to highlight best practices for the PDF format.
A lot of associations probably had interesting takes on the release of the Mueller report [PDF] last week. One that you may not have expected to hear from was the PDF Association, the organization that sets standards and best practices for the document format created by Adobe more than a quarter-century ago.
The group offered no opinion on the contents of the report, which details Special Counsel Robert Mueller’s findings from his nearly two-year investigation into Russian interference in the 2016 presidential election. Instead, it took a close look at the form in which the Department of Justice distributed the document to the public: as a scanned-in PDF file. In fact, the association said its analysis suggests the document may have been scanned in at least twice.
“The fact that DOJ chose to deliver an ‘images only’ PDF forces a much larger file size and loss of searchable text,” Executive Director Duff Johnson wrote in a blog post. “Effectively, this process ‘dumbed down’ the PDF to a set of images—the same type of content that comes out of a scanner.”
By scanning the 448-page printed report and saving it as a images-only PDF, DOJ ensured that the redactions peppered throughout were kept intact, but it also reduced accessibility and made the document impossible to search. Additionally, Johnson noted that the convoluted process used to create the PDF wasn’t necessary, as a properly designed “born-digital” PDF would allow for redactions without scanning.
The ability to access and select text was resolved earlier this week, but while the document was improved, its limitations were hard to ignore. In a second post, Johnson said that, generally, only way to search an image-only PDF is to use optical character recognition (OCR) software, but that the presence of redaction marks adjacent to text made using OCR “vastly more difficult” as the software has a hard time distinguishing text from non-text content in images. (Even the updated version of the document runs into these issues.)
Johnson concluded that the only way to work around the problems was to reconstruct each page of the document, a task The New York Times undertook, painstakingly, with a team of 22 employees.
“Searches for names, dates, places, references, evidence … they all depend on text search,” Johnson wrote. “The DOJ’s choice in how they delivered this document has made accurate text search impossible for all downstream users of the document.”
According to Johnson, PDF was the perfect format for a sensitive document like the Mueller report, in that it’s difficult to modify and so confers a sense of trust and reliability that other formats lack. “Unfortunately, the image-based PDF the Department of Justice delivered is the least easy-to-use of any option they could have chosen,” he added.
The PDF format is widely used, but it’s also easy to misuse. It allows for lots of flaws, and it’s often used when another format would be a better choice. Associations, in particular, can be guilty of this.
If your association relies on PDFs for distributing documents or sharing information, consider the use case. You might be creating headaches for your members by making a document hard to read, download, and search through.
Simply put, your PDFs shouldn’t look like the one that contained the Mueller report.
(Associations Now illustration)