Re: (PDF) missing white space between words


From: Philip Kiff
Date: Apr 27, 2020 3:47PM

I'm not sure of all the possible reasons why this happens, but I have
run into this numerous times and I have a couple theories.

Generally speaking, I notice this happening almost exclusively in files
that were not created with the "tagged PDF" feature of whatever software
generated the PDF.

It can definitely happen for example with current versions of Adobe
Illustrator and older versions of InDesign: text will often be generated
in separate chunks with one unit per line (these are not placed in
"containers"). When you attempt to tag a series of such text chunks
manually, the individual chunks may not have a blank space at the end of
the line, so your paragraph then includes various words merged together.

I have also seen space issues appear in PDFs after you use the
"auto-tag" feature in Adobe Acrobat Pro. In some files, using this
feature results in all the spaces on a page being collected together in
a single artifact tag at the end of a physical page. It's unclear to me
why this happens, but I think again it is related to weird ways that
some source software generates PDF structures. Most of the time, these
autotagged files have spaces between words that appear just fine despite
an entire other set of "artifacted" spaces. I've wondered if this is
some weird formatting side-effect of some software when text has custom
line-spacing or kerning between text, but I have no idea really.

Shadow effects and outlined text can also sometimes produce weird
effects with duplicate text. If not careful when trying to artifact the
duplicate text snippets (usually line-by-line), you can end up with text
missing spaces (usually only one of the duplicate text pieces actually
has spaces between the words, so you have to select the right one).

Also, note that some PDF remediation software has features that allow
you to insert spaces back in between words in documents that are missing


Philip Kiff
D4K Communications

On 2020-04-27 17:21, Birkir R. Gunnarsson wrote:
> Gang
> One problem I keep having with PDF files that have been remediated by
> various sources is the merging of two words into one.
> This seems to happen fairly randomly, at least non-visually.
> Why this behavior?
> Could it be due to reading order tagging of text presented in a column
> layout where there is no space between the last word in one column and
> the first word in the next?
> Any words from the wise (at least wiser than I, which isn't hard)
> would be helpful.