E-mail List Archives
Thread: Fixing OCR issues in PDF with Adobe Acrobat Pro
Number of posts in this thread: 6 (In chronological order)
From: Jonathan Avila
Date: Fri, May 14 2021 5:26PM
Subject: Fixing OCR issues in PDF with Adobe Acrobat Pro
No previous message | Next message →
Hi all, I still have not found a great way within Acrobat to address optical character recognition (OCR) errors. The situation is that the text was incorrectly recognized but Acrobat does not perceive the issues as suspect and thus the tools typically in Acrobat to fix OCR suspects are not available. I'm not sure if there is a way to flag the content as suspect somehow - but it seems silly to not allow you to edit any of the OCR text unless it's a suspect.
OCR'd content appears to have hidden objects that represent the text for the tags structure but this text is not editable itself. While Acrobat does have an edit text option in the last couple versions that does a good job in allowing you to edit the visual content in a type face that looks like OCR'd text - I am dealing with a document that can't be edited in that way for legal reasons. I need to edit the hidden text.
In addition, hacks like use of actual text don't work with mobile devices so using that approach is not an option. The only way I have found is to artifact the object and create a new text box - but the text in that and hide it behind the image. That does work across desktop and mobile assistive technology.
I also played with the preflight option to make OCR text into layers. It does a good job converting the OCR text into a different layer that can be edited. The challenge is then merging or flattening the layers back into one. When I try that I either lose the content in all the tags or I end up with duplicated text on screen even though I have chosen to not display the layer and mark the layer as a reference layer. Has anyone had luck with this method?
Does anyone have any thoughts on how best to edit OCR text in Acrobat when you cannot edit the visual text and OCR suspects are not detected? I've tried Axes Quick for PDF but it doesn't seem to have any options for this either. I believe some programs like Abbyy Fine Reader could be used but my license for that is very old.
Best Regards,
Jonathan
From: Philip Kiff
Date: Sat, May 15 2021 6:52AM
Subject: Re: Fixing OCR issues in PDF with Adobe Acrobat Pro
← Previous message | Next message →
I haven't worked on a challengin OCR'd PDF in a year or two, but I could
have sworn there was a way to get to a mode that would allow you to edit
*any* of the OCR'd text, not just the suspect text without switching to
a replacement font. The interface was terrible and the way to switch
from editing suspect text to editing any text was not at all obvious.
Mmmm....I can't find a sample of a case where I did this, so maybe I'm
mis-remembering, and I actually used the "actual text" property - which
you already indicated wouldn't meet your needs.
I've never tried the other methods you propose. And yes, it does seem
that Acrobat has an entirely other set of hidden object layer it uses to
manage OCR'd text. And I don't think axesPDF QuickFix provides any
access to it, either.
Phil.
On 2021-05-14 19:26, Jonathan Avila wrote:
> Hi all, I still have not found a great way within Acrobat to address optical character recognition (OCR) errors. The situation is that the text was incorrectly recognized but Acrobat does not perceive the issues as suspect and thus the tools typically in Acrobat to fix OCR suspects are not available. I'm not sure if there is a way to flag the content as suspect somehow - but it seems silly to not allow you to edit any of the OCR text unless it's a suspect.
>
> OCR'd content appears to have hidden objects that represent the text for the tags structure but this text is not editable itself. While Acrobat does have an edit text option in the last couple versions that does a good job in allowing you to edit the visual content in a type face that looks like OCR'd text - I am dealing with a document that can't be edited in that way for legal reasons. I need to edit the hidden text.
>
> In addition, hacks like use of actual text don't work with mobile devices so using that approach is not an option. The only way I have found is to artifact the object and create a new text box - but the text in that and hide it behind the image. That does work across desktop and mobile assistive technology.
>
> I also played with the preflight option to make OCR text into layers. It does a good job converting the OCR text into a different layer that can be edited. The challenge is then merging or flattening the layers back into one. When I try that I either lose the content in all the tags or I end up with duplicated text on screen even though I have chosen to not display the layer and mark the layer as a reference layer. Has anyone had luck with this method?
>
> Does anyone have any thoughts on how best to edit OCR text in Acrobat when you cannot edit the visual text and OCR suspects are not detected? I've tried Axes Quick for PDF but it doesn't seem to have any options for this either. I believe some programs like Abbyy Fine Reader could be used but my license for that is very old.
>
> Best Regards,
>
> Jonathan
> > > >
From: Karen McCall
Date: Sat, May 15 2021 7:07AM
Subject: Re: Fixing OCR issues in PDF with Adobe Acrobat Pro
← Previous message | Next message →
You might be able to use the Edit PDF tools IF you haven't tagged the document yet. If you have, using this tool will destroy all tags either on that page or in the document...and you have to know what is wrong in the text before you can fix it. The Edit PDF capability may show you the correct spelling and spacing but the underlying OCR is wrong. I never recommend using this tool but offer it as an option if you can get it to do what you want it to do.
The problem using Actual Text for large pieces of content is that the Text-to-Speech tools have to use a different reading mode for images of text and sometimes lose the ability to follow along with highlighting. Same with something like ZoomText Fusion...you lose the ability of JAWS to highlight where you are reading. I recommend against using the Actual Text attribute for large pieces of text and entire documents.
I use ABBYY FineReader for any PDF document that I need to OCR. The latest version even has the capabilities to add form controls to the PDF (you have to do some remediation in the Tags Tree in Acrobat but these are minor). Others use OmniPage Pro and either can be purchased on sale for a reasonable price, not a subscription.
FineReader has two ways of dealing with scanned document:
As soon as you open a scanned document the OCR is done and you can resave the document as a searchable PDF without looking at any suspects or issues of spacing between words and characters. I use this when I want to just read a PDF that isn't tagged or is a scan because I can also send the document to Word.
The other tool in FineReader (and OmniPage Pro) is the ability to create an OCR project, open the PDF and access their text editor. I can use JAWS in the text editor so I can hear when words aren't correct or if there are no spaces between words or if there are spaces between characters. There is a sort of Styles pane where you can add structure, text, images and tables are identified in the document, and I have the ability to find an replace optional hyphens. I had a really horrible scan of a book with handwritten notes, doodles and diagrams in the margins and around the text and within a few hours had a readable PDF document with my screen reader.
I never use the Acrobat OCR for the reason mentioned...I can't rely on it telling "the truth" about what it found and what it missed. I ended up spending time remediating the scanned PDF only to find that words were wrong, some paragraphs had no spaces between words and others had spaces between characters in words. I save time by using one of the stand-alone OCR tools.
Cheers, Karen
From: Philip Kiff
Date: Sat, May 15 2021 8:04AM
Subject: Re: Fixing OCR issues in PDF with Adobe Acrobat Pro
← Previous message | Next message →
Just a quick follow-up on the Adobe Acrobat Pro DC interface for OCR. I
found a file that I edited last year, and Acrobat Pro does seem to allow
editing the way I remember?
When I open this scanned PDF and have the original image displayed (i.e.
not replacement font but an exact copy of the original image), I can
open up the Scan & OCR Tool, and then select "Recognize Text" in the
toolbar, and there is a checkbox "Review recognized text" that appears
on the left in the sub-toolbar menu that opens below it. When I select
that, initially only suspects appear editable even though I've selected
the checkbox - the suspects are surrounded in red boxes. But on that
screen I can then double click randomly on a piece of text and it will
allow me to change the interpreted text for that snippet by editing the
"image ... recognized as ... "entry for that newly selected box?
It is for sure a terrible interface. And it does not actually seem like
you can edit text. I had to flip back and forth between several pages
before it started to work. But you can edit the text that way - or at
least I can in this PDF?
My interface looks similar to what I see under step 2 under "How to
correct OCR errors" on this page from OneLegal (about whom I know
nothing, but whose page I just found now because they happen to have
instructions that seem to match what I'm seeing):
https://www.onelegal.com/blog/how-to-correct-ocr-errors-using-adobe-acrobat/
Phil.
On 2021-05-15 08:52, Philip Kiff wrote:
> I haven't worked on a challengin OCR'd PDF in a year or two, but I
> could have sworn there was a way to get to a mode that would allow you
> to edit *any* of the OCR'd text, not just the suspect text without
> switching to a replacement font. The interface was terrible and the
> way to switch from editing suspect text to editing any text was not at
> all obvious. Mmmm....I can't find a sample of a case where I did this,
> so maybe I'm mis-remembering, and I actually used the "actual text"
> property - which you already indicated wouldn't meet your needs.
>
> I've never tried the other methods you propose. And yes, it does seem
> that Acrobat has an entirely other set of hidden object layer it uses
> to manage OCR'd text. And I don't think axesPDF QuickFix provides any
> access to it, either.
>
> Phil.
>
> On 2021-05-14 19:26, Jonathan Avila wrote:
>> Hi all, I still have not found a great way within Acrobat to address
>> optical character recognition (OCR) errors. The situation is that
>> the text was incorrectly recognized but Acrobat does not perceive the
>> issues as suspect and thus the tools typically in Acrobat to fix OCR
>> suspects are not available. I'm not sure if there is a way to flag
>> the content as suspect somehow - but it seems silly to not allow you
>> to edit any of the OCR text unless it's a suspect.
>>
>> OCR'd content appears to have hidden objects that represent the text
>> for the tags structure but this text is not editable itself. While
>> Acrobat does have an edit text option in the last couple versions
>> that does a good job in allowing you to edit the visual content in a
>> type face that looks like OCR'd text - I am dealing with a document
>> that can't be edited in that way for legal reasons.  I need to edit
>> the hidden text.
>>
>> In addition, hacks like use of actual text don't work with mobile
>> devices so using that approach is not an option. The only way I have
>> found is to artifact the object and create a new text box - but the
>> text in that and hide it behind the image. That does work across
>> desktop and mobile assistive technology.
>>
>> I also played with the preflight option to make OCR text into
>> layers. It does a good job converting the OCR text into a different
>> layer that can be edited. The challenge is then merging or
>> flattening the layers back into one. When I try that I either lose
>> the content in all the tags or I end up with duplicated text on
>> screen even though I have chosen to not display the layer and mark
>> the layer as a reference layer. Has anyone had luck with this method?
>>
>> Does anyone have any thoughts on how best to edit OCR text in Acrobat
>> when you cannot edit the visual text and OCR suspects are not
>> detected?  I've tried Axes Quick for PDF but it doesn't seem to have
>> any options for this either. I believe some programs like Abbyy Fine
>> Reader could be used but my license for that is very old.
>>
>> Best Regards,
>>
>> Jonathan
>> >> >> >> > > > >
From: Philip Kiff
Date: Sat, May 15 2021 8:22AM
Subject: Re: Fixing OCR issues in PDF with Adobe Acrobat Pro
← Previous message | Next message →
Oh, and here's a tip on using this method (if you can get it to work).
As Karen noted, "you have to know what is wrong in the text before you
can fix it." In such cases, I have used the "screenreader preview" view
of the PAC tool (or in my case, axesPDF Quick Fix's integrated version
of PAC) to get a copy of the "hidden" text that Acrobat is showing to
screenreaders. You can cut and paste the output from that PAC window
into your favourite text editor, and then I've used that output to run
spell-check and/or review the outputted text so I can find the OCR
errors that I want to fix.
And Karen I am sure is right that ABBY FineReader or OmniPage Pro are
better tools for all these things.
Phil.
On 2021-05-15 10:04, Philip Kiff wrote:
> Just a quick follow-up on the Adobe Acrobat Pro DC interface for OCR.Â
> I found a file that I edited last year, and Acrobat Pro does seem to
> allow editing the way I remember?
>
> When I open this scanned PDF and have the original image displayed
> (i.e. not replacement font but an exact copy of the original image), I
> can open up the Scan & OCR Tool, and then select "Recognize Text" in
> the toolbar, and there is a checkbox "Review recognized text" that
> appears on the left in the sub-toolbar menu that opens below it. When
> I select that, initially only suspects appear editable even though
> I've selected the checkbox - the suspects are surrounded in red boxes.
> But on that screen I can then double click randomly on a piece of text
> and it will allow me to change the interpreted text for that snippet
> by editing the "image ... recognized as ... "entry for that newly
> selected box?
>
> It is for sure a terrible interface. And it does not actually seem
> like you can edit text. I had to flip back and forth between several
> pages before it started to work. But you can edit the text that way -
> or at least I can in this PDF?
>
> My interface looks similar to what I see under step 2 under "How to
> correct OCR errors" on this page from OneLegal (about whom I know
> nothing, but whose page I just found now because they happen to have
> instructions that seem to match what I'm seeing):
> https://www.onelegal.com/blog/how-to-correct-ocr-errors-using-adobe-acrobat/
>
>
> Phil.
>
> On 2021-05-15 08:52, Philip Kiff wrote:
>> I haven't worked on a challengin OCR'd PDF in a year or two, but I
>> could have sworn there was a way to get to a mode that would allow
>> you to edit *any* of the OCR'd text, not just the suspect text
>> without switching to a replacement font. The interface was terrible
>> and the way to switch from editing suspect text to editing any text
>> was not at all obvious. Mmmm....I can't find a sample of a case where
>> I did this, so maybe I'm mis-remembering, and I actually used the
>> "actual text" property - which you already indicated wouldn't meet
>> your needs.
>>
>> I've never tried the other methods you propose. And yes, it does seem
>> that Acrobat has an entirely other set of hidden object layer it uses
>> to manage OCR'd text. And I don't think axesPDF QuickFix provides any
>> access to it, either.
>>
>> Phil.
>>
>> On 2021-05-14 19:26, Jonathan Avila wrote:
>>> Hi all, I still have not found a great way within Acrobat to address
>>> optical character recognition (OCR) errors. The situation is that
>>> the text was incorrectly recognized but Acrobat does not perceive
>>> the issues as suspect and thus the tools typically in Acrobat to fix
>>> OCR suspects are not available. I'm not sure if there is a way to
>>> flag the content as suspect somehow - but it seems silly to not
>>> allow you to edit any of the OCR text unless it's a suspect.
>>>
>>> OCR'd content appears to have hidden objects that represent the text
>>> for the tags structure but this text is not editable itself. While
>>> Acrobat does have an edit text option in the last couple versions
>>> that does a good job in allowing you to edit the visual content in a
>>> type face that looks like OCR'd text - I am dealing with a document
>>> that can't be edited in that way for legal reasons.  I need to edit
>>> the hidden text.
>>>
>>> In addition, hacks like use of actual text don't work with mobile
>>> devices so using that approach is not an option. The only way I
>>> have found is to artifact the object and create a new text box - but
>>> the text in that and hide it behind the image. That does work across
>>> desktop and mobile assistive technology.
>>>
>>> I also played with the preflight option to make OCR text into
>>> layers. It does a good job converting the OCR text into a different
>>> layer that can be edited. The challenge is then merging or
>>> flattening the layers back into one. When I try that I either lose
>>> the content in all the tags or I end up with duplicated text on
>>> screen even though I have chosen to not display the layer and mark
>>> the layer as a reference layer. Has anyone had luck with this method?
>>>
>>> Does anyone have any thoughts on how best to edit OCR text in
>>> Acrobat when you cannot edit the visual text and OCR suspects are
>>> not detected?  I've tried Axes Quick for PDF but it doesn't seem to
>>> have any options for this either. I believe some programs like
>>> Abbyy Fine Reader could be used but my license for that is very old.
>>>
>>> Best Regards,
>>>
>>> Jonathan
>>> >>> >>> >>> >> >> >> >> > > > >
From: Jonathan Avila
Date: Sat, May 15 2021 4:34PM
Subject: Re: Fixing OCR issues in PDF with Adobe Acrobat Pro
← Previous message | No next message
Thanks for the tips Phil and Karen, the preflight feature to make layers works well for getting the incorrect text and you can also select the text in the document and press control+c to get the incorrect text as well - but I was still not successful with any of your tips to edit only the hidden text. Re-OCRing the page and turning on review recognized text made no difference - it grays out the text and doesn't seem to make a difference - clicking around doesn't allow me to edit any of the other text. Using the edit text feature won't work as it changes the visual appearance of the text. When I re-OCR'd I even selected searchable exact text hoping that it might work better but no luck.
I agree that actualText is not meant for this and also has the affect of hiding the visual rectangle from some assistive technologies. In this case it's not well supported on mobile so it's not an option for several reasons!
Best Regards,
Jonathan