E-mail List Archives
Thread: PDF and searchable text for scanned documents
Number of posts in this thread: 11 (In chronological order)
From: Jackson, Derek J
Date: Tue, Sep 29 2020 8:23AM
Subject: PDF and searchable text for scanned documents
No previous message | Next message →
Hello,
I have a remediated scanned document and it passes Adobe's Accessibility Check and PAC3. However the underlying text does not correspond to the visible text. For example the content container for a paragraph contains text like " =X's6- H -R, $E F I A*'a" that corresponds to an area on the PDF that is unrelated to the paragraph. However all of the paragraph tags use the "Actual Text" field to provide the actual text of the paragraph. The consequence is that a screen reader will read the paragraph correctly but the document is not searchable, and copy and paste is not practical. So I am wondering if this is an instance where we have a document that meets the accessibility requirements but still it is not functionally accessible or is there something in PDF/UA that addresses this issue? I have looked through the PDF/UA spec and am not seeing anything but I readily admit that some of the technical jargon and details are beyond me.
Thanks for the continued help!
Derek
â
Derek Jackson
Digital Accessibility Developer | Digital Accessibility Services
Harvard University Information Technology
1430 Massachusetts Ave, 4th Floor
Cambridge, MA 02138
he/him/his
From: Steve Green
Date: Tue, Sep 29 2020 9:31AM
Subject: Re: PDF and searchable text for scanned documents
← Previous message | Next message →
I have encountered this several times, but I do not know what causes it. We use the axesPDF QuickFix tool to view and modify the mapping between the glyphs and the underlying Unicode characters, but we usually only need to fix one or two incorrect mappings. I guess you could go through all the mappings for all the fonts and replace the Unicode characters with the ones you want, but that sounds like a lot of work. There may be other ways to do it more efficiently.
Remember that Acrobat's Accessibility Check is only doing a very small number of very simple tests. Passing the test tells you almost nothing about the document's accessibility, other than it is probably not as terrible as it might have been.
What application was the document authored in?
Steve Green
Managing Director
Test Partners Ltd
From: Jackson, Derek J
Date: Tue, Sep 29 2020 10:20AM
Subject: Re: PDF and searchable text for scanned documents
← Previous message | Next message →
Hi Steve,
Thanks! I am glad I am not alone. The document was originally just a scanned document from a Canon scanner and then saved as a PDF. We asked someone to remediate the PDF and we got this results back but I don't know what tool they used. It seems like using any decent OCR tool would produce a better result. I am not relying on the Acrobat's Accessibility Checker either but I was a little surprised that it did not catch this, and the same for PAC3. That is what led me to wonder if this is actually out of conformance with PDF/UA (?) or if this is an example of a document that conforms to the guidelines but demonstrates and instance where the guidelines might fall short? Could someone say this type of document adheres to PDF/UA guidelines?
Best!
Derek
â
On 9/29/20, 11:31 AM, "WebAIM-Forum on behalf of Steve Green" < = EMAIL ADDRESS REMOVED = on behalf of = EMAIL ADDRESS REMOVED = > wrote:
I have encountered this several times, but I do not know what causes it. We use the axesPDF QuickFix tool to view and modify the mapping between the glyphs and the underlying Unicode characters, but we usually only need to fix one or two incorrect mappings. I guess you could go through all the mappings for all the fonts and replace the Unicode characters with the ones you want, but that sounds like a lot of work. There may be other ways to do it more efficiently.
Remember that Acrobat's Accessibility Check is only doing a very small number of very simple tests. Passing the test tells you almost nothing about the document's accessibility, other than it is probably not as terrible as it might have been.
What application was the document authored in?
Steve Green
Managing Director
Test Partners Ltd
From: Steve Green
Date: Tue, Sep 29 2020 10:34AM
Subject: Re: PDF and searchable text for scanned documents
← Previous message | Next message →
From a technical perspective, there is nothing wrong with the document. Unicode mapping is perfectly acceptable, and no tool can know which mappings are intentional and which are not. In a sense it's no different from a change of language in a web page. As long as the change of language is indicated programmatically, a tool has no idea if the change is intentional or an error. As such, there is no fault to detect. I am not even sure I agree that there is an omission from the accessibility guidelines.
I would be very interested to know who remediated the document, because they should have identified such a fundamental issue as this.
If you want to send it to me I can take a look with QuickFix and see how easy it would be to fix the Unicode mappings. If there is only one font, it might be 5 minutes' work. On the other hand, if there are lots of fonts or a very large character set it could take much longer.
Steve
From: Jonathan Avila
Date: Tue, Sep 29 2020 11:21AM
Subject: Re: PDF and searchable text for scanned documents
← Previous message | Next message →
Seems like this is an incorrect use of the actualText property.
Jonathan
From: Jackson, Derek J
Date: Tue, Sep 29 2020 11:41AM
Subject: Re: PDF and searchable text for scanned documents
← Previous message | Next message →
I thought the same Jonathan, I understood the actualText to be used for very small amounts of text and not entire paragraphs. But I could have arrived at that assumption just from my own experience and not what the actualText property requirements are.
Steve, maybe what I have is not a technical error but something that a manual check should reveal as an accessibility error? I cannot share the document because it is not mine to distribute but thank you for the offer.
Thank you again,
Derek
On 9/29/20, 1:21 PM, "WebAIM-Forum on behalf of Jonathan Avila" < = EMAIL ADDRESS REMOVED = on behalf of = EMAIL ADDRESS REMOVED = > wrote:
Seems like this is an incorrect use of the actualText property.
Jonathan
From: Steve Green
Date: Tue, Sep 29 2020 12:01PM
Subject: Re: PDF and searchable text for scanned documents
← Previous message | Next message →
You can put anything you want in ActualText. I often put entire paragraphs in there for various reasons, usually because the words are concatenated or fragmented and the root cause can't be fixed.
You can get a trial version of QuickFix from https://www.axes4.com/axespdf-quickfix-overview.html. Just open your file, go to the Unicode Mapping tool and it will show you how many mappings would need to be fixed. The trial version is fully functional, but it applies a watermark and changes some of the font colours when you save the document, so you can't send it to a client. Nevertheless, it's ideal for doing a proof of concept. We rapidly found we could justify the cost of a full license.
Steve
From: Duff Johnson
Date: Tue, Sep 29 2020 12:14PM
Subject: Re: PDF and searchable text for scanned documents
← Previous message | Next message →
> I have a remediated scanned document and it passes Adobe's Accessibility Check and PAC3. However the underlying text does not correspond to the visible text. For example the content container for a paragraph contains text like " =X's6- H -R, $E F I A*'a" that corresponds to an area on the PDF that is unrelated to the paragraph.
This fact implies that the creator did not correct the OCR results which alone invalidates any PDF/UA or WCAG conformance claim.
> However all of the paragraph tags use the "Actual Text" field to provide the actual text of the paragraph.
This is an incorrect use of ActualText (as per ISO 32000).
> The consequence is that a screen reader will read the paragraph correctly
A screen-reader might, but another type of AT such as a zoom-reader would be clueless, so "cheating" with ActualText on a scanned paragraph (or scanned page) is unacceptable.
> but the document is not searchable, and copy and paste is not practical. So I am wondering if this is an instance where we have a document that meets the accessibility requirements but still it is not functionally accessible or is there something in PDF/UA that addresses this issue?
The document does not conform to PDF/UA, but not in a way that is easy for software to detect. The validity of OCR results can be assisted by a machine but is fundamentally human-validated.
> I have looked through the PDF/UA spec and am not seeing anything but I readily admit that some of the technical jargon and details are beyond me.
It's 7.1, paragraph 10:
Documents consisting of raster-based images may be processed to generate machine-readable content. In such cases, errors resulting from the content-generation process shall be corrected and the content shall be tagged according to Clause 7.
Duff.
From: Jackson, Derek J
Date: Tue, Sep 29 2020 12:52PM
Subject: Re: PDF and searchable text for scanned documents
← Previous message | Next message →
Thank you Duff. That is exactly what I was looking for, and also thanks for pointing me to the correct spot in the PDF/UA spec. Super helpful.
And thank you again Steve. I will give the tool you recommend a try. If it can help correct PDFs like this it would be wonderful.
-Derek
On 9/29/20, 2:14 PM, "WebAIM-Forum on behalf of Duff Johnson" < = EMAIL ADDRESS REMOVED = on behalf of = EMAIL ADDRESS REMOVED = > wrote:
> I have a remediated scanned document and it passes Adobe's Accessibility Check and PAC3. However the underlying text does not correspond to the visible text. For example the content container for a paragraph contains text like " =X's6- H -R, $E F I A*'a" that corresponds to an area on the PDF that is unrelated to the paragraph.
This fact implies that the creator did not correct the OCR results which alone invalidates any PDF/UA or WCAG conformance claim.
> However all of the paragraph tags use the "Actual Text" field to provide the actual text of the paragraph.
This is an incorrect use of ActualText (as per ISO 32000).
> The consequence is that a screen reader will read the paragraph correctly
A screen-reader might, but another type of AT such as a zoom-reader would be clueless, so "cheating" with ActualText on a scanned paragraph (or scanned page) is unacceptable.
> but the document is not searchable, and copy and paste is not practical. So I am wondering if this is an instance where we have a document that meets the accessibility requirements but still it is not functionally accessible or is there something in PDF/UA that addresses this issue?
The document does not conform to PDF/UA, but not in a way that is easy for software to detect. The validity of OCR results can be assisted by a machine but is fundamentally human-validated.
> I have looked through the PDF/UA spec and am not seeing anything but I readily admit that some of the technical jargon and details are beyond me.
It's 7.1, paragraph 10:
Documents consisting of raster-based images may be processed to generate machine-readable content. In such cases, errors resulting from the content-generation process shall be corrected and the content shall be tagged according to Clause 7.
Duff.
From: Steve Green
Date: Tue, Sep 29 2020 1:10PM
Subject: Re: PDF and searchable text for scanned documents
← Previous message | Next message →
Duff, which WCAG success criteria would you say the document fails? You could argue that it fails 1.1.1 because the glyphs are non-text content and they do not have text equivalents that serve the same purpose. Or were you thinking of something else?
Steve
From: Duff Johnson
Date: Tue, Sep 29 2020 1:42PM
Subject: Re: PDF and searchable text for scanned documents
← Previous message | No next message
Hi Steve,
Yes, 1.1.1. If the text isn't readable, it isn't text, so 1.1.1 mandates OCR validation.
Duff.
> On Sep 29, 2020, at 15:10, Steve Green < = EMAIL ADDRESS REMOVED = > wrote:
>
> Duff, which WCAG success criteria would you say the document fails? You could argue that it fails 1.1.1 because the glyphs are non-text content and they do not have text equivalents that serve the same purpose. Or were you thinking of something else?
>
> Steve
>
>
>