WebAIM - Web Accessibility In Mind

E-mail List Archives

Re: tagged PDF and files size

for

From: Olaf Drümmer
Date: May 11, 2015 5:04AM


Anybody wishing to get a good idea of how his/her own tagged PDFs might be subject to file size increase, do the following:
- save your tagged PDF as optimised PDF in Adobe Acrobat (enabling object compression, but not removing data nor re-compressing images nor reducing image resolution)
- measure file size
- save already optimised tagged PDF again as optimised PDF in Acrobat, this time enabling Discard Objects -> Discard Document Tags
- measure file size

As an example, I have taken the Matterhorn Protocol document (text heavy, almost no image data, some non-trivial amount of tagging due to table structures) and applied the procedure described above:
with tags: 348 kb
tags removed: 314 kb
delta: 34kb; ca. 11% file size increase

Taking the 32 page magazine style publication "Øjeblikket nr. 1 · Juni 2013" from the Danish association of visually impaired people (quite a number of images, as is typical for magazines), I get:
with tags: 3.038 kb
tags removed: 2995 kb
delta: 43kb; ca. 1.5% file size increase

Note: The Matterhorn Protocol and the Danish "Øjeblikket" issue are part of the "PDF/UA Reference Suite" (http://www.pdfa.org/publication/pdfua-reference-suite/ ) from the PDF Association (download is of course free of charge).

I would say this is the typical range to expect. In PDFs that contain images, the extra file size due to data structures for tags will hardly be noticed; in PDFs that contain just text, the file size increase could be up to ca. 10%.

Obviously the amount of file size increase due to tagging also depends on the density / granularity of structural data.

One relevant reason that file size increase will typically be relatively minor is the fact that most tagged PDF files would use compressed object streams - that is a mechanism in the PDF syntax where many data objects are compressed together as a single stream (as a user you would never notice this). Given the nature of data structures related to tags (mostly 'text' based pieces of information, using the same keywords over and over again -> high redundancy) compression works pretty well.


In terms of reducing the file size of already tagged PDF:
after all editing (if any) is done, always do a "Save as" as this will remove unnecessary data (whether related to tags or not). Using "Save as optimised PDF" in Acrobat is one good option to do that, other PDF tools provide similar features. Enable use of "compressed object streams" when doing such a "Save as". If your PDF has lots of images - then there is probably bigger fish to fry than tags if you want to bring your PDF's file size down. Also, some versions of some tools (older versions of Creative Suite, e.g. CS4) had a tendency to accumulate enormous amounts of XMP metadata (for a document, but also through imported images and graphics). You can find out whether this might be the case by using (in Acrobat) Save as -> Save as optimised PDF -> Settings -> Audit space usage, and then look at 'document overhead' (space used by metadata would be included in this count, as would 'private data', i.e. data embedded by some PDF tool for some proprietary purpose). The "Audit space usage" info is a good diagnostic tool in general, when it comes to file sizes. It's a shame it is hidden so well.


That much said: From what I have seen in real world (tagged) PDF files, in most cases unnecessary file size went back to reasons other than the presence of tags…


Olaf


On 11 May 2015, at 12:14, Steve Faulkner < <EMAIL REMOVED> > wrote:

> Any info on expected file size increase when a PDF is tagged and any advice
> on how to reduce file size of tagged PDF's?
> --
>
> Regards
>
> SteveF
> HTML 5.1 <http://www.w3.org/html/wg/drafts/html/master/>;
> > > >