WebAIM - Web Accessibility In Mind

E-mail List Archives

Re: PDF and searchable text for scanned documents

for

From: Jackson, Derek J
Date: Sep 29, 2020 12:52PM


Thank you Duff. That is exactly what I was looking for, and also thanks for pointing me to the correct spot in the PDF/UA spec. Super helpful.

And thank you again Steve. I will give the tool you recommend a try. If it can help correct PDFs like this it would be wonderful.

-Derek

On 9/29/20, 2:14 PM, "WebAIM-Forum on behalf of Duff Johnson" < <EMAIL REMOVED> on behalf of <EMAIL REMOVED> > wrote:

> I have a remediated scanned document and it passes Adobe's Accessibility Check and PAC3. However the underlying text does not correspond to the visible text. For example the content container for a paragraph contains text like " =X's6- H -R, $E F I A*'a" that corresponds to an area on the PDF that is unrelated to the paragraph.

This fact implies that the creator did not correct the OCR results which alone invalidates any PDF/UA or WCAG conformance claim.

> However all of the paragraph tags use the "Actual Text" field to provide the actual text of the paragraph.

This is an incorrect use of ActualText (as per ISO 32000).

> The consequence is that a screen reader will read the paragraph correctly

A screen-reader might, but another type of AT such as a zoom-reader would be clueless, so "cheating" with ActualText on a scanned paragraph (or scanned page) is unacceptable.

> but the document is not searchable, and copy and paste is not practical. So I am wondering if this is an instance where we have a document that meets the accessibility requirements but still it is not functionally accessible or is there something in PDF/UA that addresses this issue?

The document does not conform to PDF/UA, but not in a way that is easy for software to detect. The validity of OCR results can be assisted by a machine but is fundamentally human-validated.

> I have looked through the PDF/UA spec and am not seeing anything but I readily admit that some of the technical jargon and details are beyond me.

It's 7.1, paragraph 10:

Documents consisting of raster-based images may be processed to generate machine-readable content. In such cases, errors resulting from the content-generation process shall be corrected and the content shall be tagged according to Clause 7.

Duff.