WebAIM - Web Accessibility In Mind

E-mail List Archives

Re: PDF and searchable text for scanned documents

for

From: Duff Johnson
Date: Sep 29, 2020 1:42PM


Hi Steve,

Yes, 1.1.1. If the text isn't readable, it isn't text, so 1.1.1 mandates OCR validation.

Duff.

> On Sep 29, 2020, at 15:10, Steve Green < <EMAIL REMOVED> > wrote:
>
> Duff, which WCAG success criteria would you say the document fails? You could argue that it fails 1.1.1 because the glyphs are non-text content and they do not have text equivalents that serve the same purpose. Or were you thinking of something else?
>
> Steve
>
>
> -----Original Message-----
> From: WebAIM-Forum < <EMAIL REMOVED> > On Behalf Of Jackson, Derek J
> Sent: 29 September 2020 19:53
> To: WebAIM Discussion List < <EMAIL REMOVED> >
> Subject: Re: [WebAIM] PDF and searchable text for scanned documents
>
> Thank you Duff. That is exactly what I was looking for, and also thanks for pointing me to the correct spot in the PDF/UA spec. Super helpful.
>
> And thank you again Steve. I will give the tool you recommend a try. If it can help correct PDFs like this it would be wonderful.
>
> -Derek
>
> On 9/29/20, 2:14 PM, "WebAIM-Forum on behalf of Duff Johnson" < <EMAIL REMOVED> on behalf of <EMAIL REMOVED> > wrote:
>
>> I have a remediated scanned document and it passes Adobe's Accessibility Check and PAC3. However the underlying text does not correspond to the visible text. For example the content container for a paragraph contains text like " =X's6- H -R, $E F I A*'a" that corresponds to an area on the PDF that is unrelated to the paragraph.
>
> This fact implies that the creator did not correct the OCR results which alone invalidates any PDF/UA or WCAG conformance claim.
>
>> However all of the paragraph tags use the "Actual Text" field to provide the actual text of the paragraph.
>
> This is an incorrect use of ActualText (as per ISO 32000).
>
>> The consequence is that a screen reader will read the paragraph correctly
>
> A screen-reader might, but another type of AT such as a zoom-reader would be clueless, so "cheating" with ActualText on a scanned paragraph (or scanned page) is unacceptable.
>
>> but the document is not searchable, and copy and paste is not practical. So I am wondering if this is an instance where we have a document that meets the accessibility requirements but still it is not functionally accessible or is there something in PDF/UA that addresses this issue?
>
> The document does not conform to PDF/UA, but not in a way that is easy for software to detect. The validity of OCR results can be assisted by a machine but is fundamentally human-validated.
>
>> I have looked through the PDF/UA spec and am not seeing anything but I readily admit that some of the technical jargon and details are beyond me.
>
> It's 7.1, paragraph 10:
>
> Documents consisting of raster-based images may be processed to generate machine-readable content. In such cases, errors resulting from the content-generation process shall be corrected and the content shall be tagged according to Clause 7.
>
> Duff.
> > > > >
> > > > > > >