WebAIM - Web Accessibility In Mind

E-mail List Archives

Re: PDF and searchable text for scanned documents


From: Steve Green
Date: Sep 29, 2020 10:34AM

From a technical perspective, there is nothing wrong with the document. Unicode mapping is perfectly acceptable, and no tool can know which mappings are intentional and which are not. In a sense it's no different from a change of language in a web page. As long as the change of language is indicated programmatically, a tool has no idea if the change is intentional or an error. As such, there is no fault to detect. I am not even sure I agree that there is an omission from the accessibility guidelines.

I would be very interested to know who remediated the document, because they should have identified such a fundamental issue as this.

If you want to send it to me I can take a look with QuickFix and see how easy it would be to fix the Unicode mappings. If there is only one font, it might be 5 minutes' work. On the other hand, if there are lots of fonts or a very large character set it could take much longer.


-----Original Message-----
From: WebAIM-Forum < <EMAIL REMOVED> > On Behalf Of Jackson, Derek J
Sent: 29 September 2020 17:20
To: WebAIM Discussion List < <EMAIL REMOVED> >
Subject: Re: [WebAIM] PDF and searchable text for scanned documents

Hi Steve,

Thanks! I am glad I am not alone. The document was originally just a scanned document from a Canon scanner and then saved as a PDF. We asked someone to remediate the PDF and we got this results back but I don't know what tool they used. It seems like using any decent OCR tool would produce a better result. I am not relying on the Acrobat's Accessibility Checker either but I was a little surprised that it did not catch this, and the same for PAC3. That is what led me to wonder if this is actually out of conformance with PDF/UA (?) or if this is an example of a document that conforms to the guidelines but demonstrates and instance where the guidelines might fall short? Could someone say this type of document adheres to PDF/UA guidelines?



On 9/29/20, 11:31 AM, "WebAIM-Forum on behalf of Steve Green" < <EMAIL REMOVED> on behalf of <EMAIL REMOVED> > wrote:

I have encountered this several times, but I do not know what causes it. We use the axesPDF QuickFix tool to view and modify the mapping between the glyphs and the underlying Unicode characters, but we usually only need to fix one or two incorrect mappings. I guess you could go through all the mappings for all the fonts and replace the Unicode characters with the ones you want, but that sounds like a lot of work. There may be other ways to do it more efficiently.

Remember that Acrobat's Accessibility Check is only doing a very small number of very simple tests. Passing the test tells you almost nothing about the document's accessibility, other than it is probably not as terrible as it might have been.

What application was the document authored in?

Steve Green
Managing Director
Test Partners Ltd

-----Original Message-----
From: WebAIM-Forum < <EMAIL REMOVED> > On Behalf Of Jackson, Derek J
Sent: 29 September 2020 15:24
Subject: [WebAIM] PDF and searchable text for scanned documents


I have a remediated scanned document and it passes Adobe's Accessibility Check and PAC3. However the underlying text does not correspond to the visible text. For example the content container for a paragraph contains text like " =X's6- H -R, $E F I A*'a" that corresponds to an area on the PDF that is unrelated to the paragraph. However all of the paragraph tags use the "Actual Text" field to provide the actual text of the paragraph. The consequence is that a screen reader will read the paragraph correctly but the document is not searchable, and copy and paste is not practical. So I am wondering if this is an instance where we have a document that meets the accessibility requirements but still it is not functionally accessible or is there something in PDF/UA that addresses this issue? I have looked through the PDF/UA spec and am not seeing anything but I readily admit that some of the technical jargon and details are beyond me.

Thanks for the continued help!


Derek Jackson

Digital Accessibility Developer | Digital Accessibility Services Harvard University Information Technology
1430 Massachusetts Ave, 4th Floor
Cambridge, MA 02138