WebAIM - Web Accessibility In Mind

E-mail List Archives

Re: PDF and searchable text for scanned documents


From: Steve Green
Date: Sep 29, 2020 12:01PM

You can put anything you want in ActualText. I often put entire paragraphs in there for various reasons, usually because the words are concatenated or fragmented and the root cause can't be fixed.

You can get a trial version of QuickFix from https://www.axes4.com/axespdf-quickfix-overview.html. Just open your file, go to the Unicode Mapping tool and it will show you how many mappings would need to be fixed. The trial version is fully functional, but it applies a watermark and changes some of the font colours when you save the document, so you can't send it to a client. Nevertheless, it's ideal for doing a proof of concept. We rapidly found we could justify the cost of a full license.


-----Original Message-----
From: WebAIM-Forum < <EMAIL REMOVED> > On Behalf Of Jackson, Derek J
Sent: 29 September 2020 18:41
To: WebAIM Discussion List < <EMAIL REMOVED> >
Subject: Re: [WebAIM] PDF and searchable text for scanned documents

I thought the same Jonathan, I understood the actualText to be used for very small amounts of text and not entire paragraphs. But I could have arrived at that assumption just from my own experience and not what the actualText property requirements are.

Steve, maybe what I have is not a technical error but something that a manual check should reveal as an accessibility error? I cannot share the document because it is not mine to distribute but thank you for the offer.

Thank you again,

On 9/29/20, 1:21 PM, "WebAIM-Forum on behalf of Jonathan Avila" < <EMAIL REMOVED> on behalf of <EMAIL REMOVED> > wrote:

Seems like this is an incorrect use of the actualText property.


-----Original Message-----
From: WebAIM-Forum < <EMAIL REMOVED> > On Behalf Of Jackson, Derek J
Sent: Tuesday, September 29, 2020 10:24 AM
Subject: [WebAIM] PDF and searchable text for scanned documents

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.


I have a remediated scanned document and it passes Adobe's Accessibility Check and PAC3. However the underlying text does not correspond to the visible text. For example the content container for a paragraph contains text like " =X's6- H -R, $E F I A*'a" that corresponds to an area on the PDF that is unrelated to the paragraph. However all of the paragraph tags use the "Actual Text" field to provide the actual text of the paragraph. The consequence is that a screen reader will read the paragraph correctly but the document is not searchable, and copy and paste is not practical. So I am wondering if this is an instance where we have a document that meets the accessibility requirements but still it is not functionally accessible or is there something in PDF/UA that addresses this issue? I have looked through the PDF/UA spec and am not seeing anything but I readily admit that some of the technical jargon and details are beyond me.

Thanks for the continued help!


Derek Jackson

Digital Accessibility Developer | Digital Accessibility Services Harvard University Information Technology
1430 Massachusetts Ave, 4th Floor
Cambridge, MA 02138