WebAIM - Web Accessibility In Mind

E-mail List Archives

Re: Fixing OCR issues in PDF with Adobe Acrobat Pro

for

From: Jonathan Avila
Date: May 15, 2021 4:34PM


Thanks for the tips Phil and Karen, the preflight feature to make layers works well for getting the incorrect text and you can also select the text in the document and press control+c to get the incorrect text as well - but I was still not successful with any of your tips to edit only the hidden text. Re-OCRing the page and turning on review recognized text made no difference - it grays out the text and doesn't seem to make a difference - clicking around doesn't allow me to edit any of the other text. Using the edit text feature won't work as it changes the visual appearance of the text. When I re-OCR'd I even selected searchable exact text hoping that it might work better but no luck.

I agree that actualText is not meant for this and also has the affect of hiding the visual rectangle from some assistive technologies. In this case it's not well supported on mobile so it's not an option for several reasons!

Best Regards,

Jonathan

-----Original Message-----
From: WebAIM-Forum < <EMAIL REMOVED> > On Behalf Of Philip Kiff
Sent: Saturday, May 15, 2021 10:22 AM
To: <EMAIL REMOVED>
Subject: Re: [WebAIM] Fixing OCR issues in PDF with Adobe Acrobat Pro

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.


Oh, and here's a tip on using this method (if you can get it to work).

As Karen noted, "you have to know what is wrong in the text before you can fix it." In such cases, I have used the "screenreader preview" view of the PAC tool (or in my case, axesPDF Quick Fix's integrated version of PAC) to get a copy of the "hidden" text that Acrobat is showing to screenreaders. You can cut and paste the output from that PAC window into your favourite text editor, and then I've used that output to run spell-check and/or review the outputted text so I can find the OCR errors that I want to fix.

And Karen I am sure is right that ABBY FineReader or OmniPage Pro are better tools for all these things.

Phil.

On 2021-05-15 10:04, Philip Kiff wrote:
> Just a quick follow-up on the Adobe Acrobat Pro DC interface for OCR.
> I found a file that I edited last year, and Acrobat Pro does seem to
> allow editing the way I remember?
>
> When I open this scanned PDF and have the original image displayed
> (i.e. not replacement font but an exact copy of the original image), I
> can open up the Scan & OCR Tool, and then select "Recognize Text" in
> the toolbar, and there is a checkbox "Review recognized text" that
> appears on the left in the sub-toolbar menu that opens below it. When
> I select that, initially only suspects appear editable even though
> I've selected the checkbox - the suspects are surrounded in red boxes.
> But on that screen I can then double click randomly on a piece of text
> and it will allow me to change the interpreted text for that snippet
> by editing the "image ... recognized as ... "entry for that newly
> selected box?
>
> It is for sure a terrible interface. And it does not actually seem
> like you can edit text. I had to flip back and forth between several
> pages before it started to work. But you can edit the text that way -
> or at least I can in this PDF?
>
> My interface looks similar to what I see under step 2 under "How to
> correct OCR errors" on this page from OneLegal (about whom I know
> nothing, but whose page I just found now because they happen to have
> instructions that seem to match what I'm seeing):
> https://www.onelegal.com/blog/how-to-correct-ocr-errors-using-adobe-ac
> robat/
>
>
> Phil.
>
> On 2021-05-15 08:52, Philip Kiff wrote:
>> I haven't worked on a challengin OCR'd PDF in a year or two, but I
>> could have sworn there was a way to get to a mode that would allow
>> you to edit *any* of the OCR'd text, not just the suspect text
>> without switching to a replacement font. The interface was terrible
>> and the way to switch from editing suspect text to editing any text
>> was not at all obvious. Mmmm....I can't find a sample of a case where
>> I did this, so maybe I'm mis-remembering, and I actually used the
>> "actual text" property - which you already indicated wouldn't meet
>> your needs.
>>
>> I've never tried the other methods you propose. And yes, it does seem
>> that Acrobat has an entirely other set of hidden object layer it uses
>> to manage OCR'd text. And I don't think axesPDF QuickFix provides any
>> access to it, either.
>>
>> Phil.
>>
>> On 2021-05-14 19:26, Jonathan Avila wrote:
>>> Hi all, I still have not found a great way within Acrobat to address
>>> optical character recognition (OCR) errors. The situation is that
>>> the text was incorrectly recognized but Acrobat does not perceive
>>> the issues as suspect and thus the tools typically in Acrobat to fix
>>> OCR suspects are not available. I'm not sure if there is a way to
>>> flag the content as suspect somehow - but it seems silly to not
>>> allow you to edit any of the OCR text unless it's a suspect.
>>>
>>> OCR'd content appears to have hidden objects that represent the text
>>> for the tags structure but this text is not editable itself. While
>>> Acrobat does have an edit text option in the last couple versions
>>> that does a good job in allowing you to edit the visual content in a
>>> type face that looks like OCR'd text - I am dealing with a document
>>> that can't be edited in that way for legal reasons. I need to edit
>>> the hidden text.
>>>
>>> In addition, hacks like use of actual text don't work with mobile
>>> devices so using that approach is not an option. The only way I
>>> have found is to artifact the object and create a new text box - but
>>> the text in that and hide it behind the image. That does work across
>>> desktop and mobile assistive technology.
>>>
>>> I also played with the preflight option to make OCR text into
>>> layers. It does a good job converting the OCR text into a different
>>> layer that can be edited. The challenge is then merging or
>>> flattening the layers back into one. When I try that I either lose
>>> the content in all the tags or I end up with duplicated text on
>>> screen even though I have chosen to not display the layer and mark
>>> the layer as a reference layer. Has anyone had luck with this method?
>>>
>>> Does anyone have any thoughts on how best to edit OCR text in
>>> Acrobat when you cannot edit the visual text and OCR suspects are
>>> not detected? I've tried Axes Quick for PDF but it doesn't seem to
>>> have any options for this either. I believe some programs like
>>> Abbyy Fine Reader could be used but my license for that is very old.
>>>
>>> Best Regards,
>>>
>>> Jonathan
>>> >>> >>> archives at http://webaim.org/discussion/archives
>>> >> >> >> archives at http://webaim.org/discussion/archives
>> > > > archives at http://webaim.org/discussion/archives
>