WebAIM - Web Accessibility In Mind

E-mail List Archives

Re: Fixing OCR issues in PDF with Adobe Acrobat Pro

for

From: Karen McCall
Date: May 15, 2021 7:07AM


You might be able to use the Edit PDF tools IF you haven't tagged the document yet. If you have, using this tool will destroy all tags either on that page or in the document...and you have to know what is wrong in the text before you can fix it. The Edit PDF capability may show you the correct spelling and spacing but the underlying OCR is wrong. I never recommend using this tool but offer it as an option if you can get it to do what you want it to do.

The problem using Actual Text for large pieces of content is that the Text-to-Speech tools have to use a different reading mode for images of text and sometimes lose the ability to follow along with highlighting. Same with something like ZoomText Fusion...you lose the ability of JAWS to highlight where you are reading. I recommend against using the Actual Text attribute for large pieces of text and entire documents.

I use ABBYY FineReader for any PDF document that I need to OCR. The latest version even has the capabilities to add form controls to the PDF (you have to do some remediation in the Tags Tree in Acrobat but these are minor). Others use OmniPage Pro and either can be purchased on sale for a reasonable price, not a subscription.

FineReader has two ways of dealing with scanned document:

As soon as you open a scanned document the OCR is done and you can resave the document as a searchable PDF without looking at any suspects or issues of spacing between words and characters. I use this when I want to just read a PDF that isn't tagged or is a scan because I can also send the document to Word.

The other tool in FineReader (and OmniPage Pro) is the ability to create an OCR project, open the PDF and access their text editor. I can use JAWS in the text editor so I can hear when words aren't correct or if there are no spaces between words or if there are spaces between characters. There is a sort of Styles pane where you can add structure, text, images and tables are identified in the document, and I have the ability to find an replace optional hyphens. I had a really horrible scan of a book with handwritten notes, doodles and diagrams in the margins and around the text and within a few hours had a readable PDF document with my screen reader.

I never use the Acrobat OCR for the reason mentioned...I can't rely on it telling "the truth" about what it found and what it missed. I ended up spending time remediating the scanned PDF only to find that words were wrong, some paragraphs had no spaces between words and others had spaces between characters in words. I save time by using one of the stand-alone OCR tools.

Cheers, Karen



-----Original Message-----
From: WebAIM-Forum < <EMAIL REMOVED> > On Behalf Of Philip Kiff
Sent: Saturday, May 15, 2021 8:53 AM
To: <EMAIL REMOVED>
Subject: Re: [WebAIM] Fixing OCR issues in PDF with Adobe Acrobat Pro

I haven't worked on a challengin OCR'd PDF in a year or two, but I could have sworn there was a way to get to a mode that would allow you to edit
*any* of the OCR'd text, not just the suspect text without switching to a replacement font. The interface was terrible and the way to switch from editing suspect text to editing any text was not at all obvious.
Mmmm....I can't find a sample of a case where I did this, so maybe I'm mis-remembering, and I actually used the "actual text" property - which you already indicated wouldn't meet your needs.

I've never tried the other methods you propose. And yes, it does seem that Acrobat has an entirely other set of hidden object layer it uses to manage OCR'd text. And I don't think axesPDF QuickFix provides any access to it, either.

Phil.

On 2021-05-14 19:26, Jonathan Avila wrote:
> Hi all, I still have not found a great way within Acrobat to address optical character recognition (OCR) errors. The situation is that the text was incorrectly recognized but Acrobat does not perceive the issues as suspect and thus the tools typically in Acrobat to fix OCR suspects are not available. I'm not sure if there is a way to flag the content as suspect somehow - but it seems silly to not allow you to edit any of the OCR text unless it's a suspect.
>
> OCR'd content appears to have hidden objects that represent the text for the tags structure but this text is not editable itself. While Acrobat does have an edit text option in the last couple versions that does a good job in allowing you to edit the visual content in a type face that looks like OCR'd text - I am dealing with a document that can't be edited in that way for legal reasons. I need to edit the hidden text.
>
> In addition, hacks like use of actual text don't work with mobile devices so using that approach is not an option. The only way I have found is to artifact the object and create a new text box - but the text in that and hide it behind the image. That does work across desktop and mobile assistive technology.
>
> I also played with the preflight option to make OCR text into layers. It does a good job converting the OCR text into a different layer that can be edited. The challenge is then merging or flattening the layers back into one. When I try that I either lose the content in all the tags or I end up with duplicated text on screen even though I have chosen to not display the layer and mark the layer as a reference layer. Has anyone had luck with this method?
>
> Does anyone have any thoughts on how best to edit OCR text in Acrobat when you cannot edit the visual text and OCR suspects are not detected? I've tried Axes Quick for PDF but it doesn't seem to have any options for this either. I believe some programs like Abbyy Fine Reader could be used but my license for that is very old.
>
> Best Regards,
>
> Jonathan
> > > https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flist.w
> ebaim.org%2F&amp;data%7C01%7C%7C82bc3a1ac6bf4951431608d917a052d7%7C
> 84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637566799616404603%7CUnknow
> n%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC
> JXVCI6Mn0%3D%7C1000&amp;sdata=4QPMKDTwlQQxVX79ODNYe4ZRVYfuKPtDtvmXbNrv
> ABU%3D&amp;reserved=0 List archives at
> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwebaim
> .org%2Fdiscussion%2Farchives&amp;data%7C01%7C%7C82bc3a1ac6bf4951431
> 608d917a052d7%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C63756679961
> 6404603%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLC
> JBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=kg3VEnS2%2FhVOW7Of79F565
> DADQ%2FKIsq1vk9oYLVYb08%3D&amp;reserved=0
>