WebAIM - Web Accessibility In Mind

E-mail List Archives

Re: Fixing OCR issues in PDF with Adobe Acrobat Pro

for

From: Philip Kiff
Date: May 15, 2021 8:22AM


Oh, and here's a tip on using this method (if you can get it to work).

As Karen noted, "you have to know what is wrong in the text before you
can fix it." In such cases, I have used the "screenreader preview" view
of the PAC tool (or in my case, axesPDF Quick Fix's integrated version
of PAC) to get a copy of the "hidden" text that Acrobat is showing to
screenreaders. You can cut and paste the output from that PAC window
into your favourite text editor, and then I've used that output to run
spell-check and/or review the outputted text so I can find the OCR
errors that I want to fix.

And Karen I am sure is right that ABBY FineReader or OmniPage Pro are
better tools for all these things.

Phil.

On 2021-05-15 10:04, Philip Kiff wrote:
> Just a quick follow-up on the Adobe Acrobat Pro DC interface for OCR. 
> I found a file that I edited last year, and Acrobat Pro does seem to
> allow editing the way I remember?
>
> When I open this scanned PDF and have the original image displayed
> (i.e. not replacement font but an exact copy of the original image), I
> can open up the Scan & OCR Tool, and then select "Recognize Text" in
> the toolbar, and there is a checkbox "Review recognized text" that
> appears on the left in the sub-toolbar menu that opens below it. When
> I select that, initially only suspects appear editable even though
> I've selected the checkbox - the suspects are surrounded in red boxes.
> But on that screen I can then double click randomly on a piece of text
> and it will allow me to change the interpreted text for that snippet
> by editing the "image ... recognized as ... "entry for that newly
> selected box?
>
> It is for sure a terrible interface. And it does not actually seem
> like you can edit text. I had to flip back and forth between several
> pages before it started to work. But you can edit the text that way -
> or at least I can in this PDF?
>
> My interface looks similar to what I see under step 2 under "How to
> correct OCR errors" on this page from OneLegal (about whom I know
> nothing, but whose page I just found now because they happen to have
> instructions that seem to match what I'm seeing):
> https://www.onelegal.com/blog/how-to-correct-ocr-errors-using-adobe-acrobat/
>
>
> Phil.
>
> On 2021-05-15 08:52, Philip Kiff wrote:
>> I haven't worked on a challengin OCR'd PDF in a year or two, but I
>> could have sworn there was a way to get to a mode that would allow
>> you to edit *any* of the OCR'd text, not just the suspect text
>> without switching to a replacement font. The interface was terrible
>> and the way to switch from editing suspect text to editing any text
>> was not at all obvious. Mmmm....I can't find a sample of a case where
>> I did this, so maybe I'm mis-remembering, and I actually used the
>> "actual text" property - which you already indicated wouldn't meet
>> your needs.
>>
>> I've never tried the other methods you propose. And yes, it does seem
>> that Acrobat has an entirely other set of hidden object layer it uses
>> to manage OCR'd text. And I don't think axesPDF QuickFix provides any
>> access to it, either.
>>
>> Phil.
>>
>> On 2021-05-14 19:26, Jonathan Avila wrote:
>>> Hi all, I still have not found a great way within Acrobat to address
>>> optical character recognition (OCR) errors.  The situation is that
>>> the text was incorrectly recognized but Acrobat does not perceive
>>> the issues as suspect and thus the tools typically in Acrobat to fix
>>> OCR suspects are not available.  I'm not sure if there is a way to
>>> flag the content as suspect somehow - but it seems silly to not
>>> allow you to edit any of the OCR text unless it's a suspect.
>>>
>>> OCR'd content appears to have hidden objects that represent the text
>>> for the tags structure but this text is not editable itself.  While
>>> Acrobat does have an edit text option in the last couple versions
>>> that does a good job in allowing you to edit the visual content in a
>>> type face that looks like OCR'd text - I am dealing with a document
>>> that can't be edited in that way for legal reasons.   I need to edit
>>> the hidden text.
>>>
>>> In addition, hacks like use of actual text don't work with mobile
>>> devices so using that approach is not an option.  The only way I
>>> have found is to artifact the object and create a new text box - but
>>> the text in that and hide it behind the image. That does work across
>>> desktop and mobile assistive technology.
>>>
>>> I also played with the preflight option to make OCR text into
>>> layers.  It does a good job converting the OCR text into a different
>>> layer that can be edited.  The challenge is then merging or
>>> flattening the layers back into one.  When I try that I either lose
>>> the content in all the tags or I end up with duplicated text on
>>> screen even though I have chosen to not display the layer and mark
>>> the layer as a reference layer.  Has anyone had luck with this method?
>>>
>>> Does anyone have any thoughts on how best to edit OCR text in
>>> Acrobat when you cannot edit the visual text and OCR suspects are
>>> not detected?   I've tried Axes Quick for PDF but it doesn't seem to
>>> have any options for this either.  I believe some programs like
>>> Abbyy Fine Reader could be used but my license for that is very old.
>>>
>>> Best Regards,
>>>
>>> Jonathan
>>> >>> >>> >>> >> >> >> >> > > > >