WebAIM - Web Accessibility In Mind

E-mail List Archives

Re: Title, tags, lang - where are they in a PDF document? Beginning or end?

for

From: Duff Johnson
Date: May 24, 2017 12:35PM


Hi Corrine,

> I think I need to clarify - this is a scan using code not a physical
> scanner. We've developed a scan for our Moodle instance. It can recognize
> text vs. an image of text but we are working on refining that scan
> further. Large documents take up a lot of cpu/memory so we are thinking we
> might be able to limit our scan the first 5-10 pages to see if there is a
> title, tags, etc. I'm just not sure where that data is stored - at the
> beginning or at the end of the PDF.

If the question is: "how do I find out if this PDF is tagged?- The information denoting tags (structure elements) in the PDF are located in the body of the file. The nature of PDF is such that it's not easy to predict where in the file the information specific to structure elements (tags) is.

If your tool can parse, even a little, you would do well to spend a few minutes with the PDF specification - ISO 32000-1. It's available for free from here.

http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf

The short version: you are looking for a 'structure element dictionary-. See clauses 14.7 and 14.8.

> I know this is very technical question and maybe obscure but I figured this
> might be the right group.

Not really. If you are interested in developing PDF technology you might consider joining the PDF Association - see pdfa.org. Full disclosure: I'm the Executive Director of the PDF Association. Feel free to ask me any questions offlist.

Thanks,

Duff.

>
> ---------- Forwarded message ----------
> From: Corrine Schoeb < <EMAIL REMOVED> >
> Date: Wed, May 24, 2017 at 9:38 AM
> Subject: Title, tags, lang - where are they in a PDF document? Beginning or
> end?
> To: <EMAIL REMOVED>
>
>
> We are working on creating a scan of PDF documents, some of which are 100+
> pages. Rather than scan the full document to find out if it is tagged, has
> a title and language we thought we might be able to do the first 5-10 pages
> but I'm not sure where the title, tag, lang data is stored in a PDF.
>
> So my question is, are title, tag, lang attributes of a PDF stored at the
> beginning of a PDF or at the end?
>
> --
>
> Corrine Schoeb
> Technology Accessibility Coordinator, ITS
> 610-957-6208 <(610)%20957-6208>
>
> *** Swarthmore College ITS will never ask you for your password, including
> by email. Please keep your passwords private to protect yourself and the
> security of our network.
>
> To learn more about web security visit http://www.swarthmore.
> edu/its/security
>
>
>
>
> --
>
> Corrine Schoeb
> Technology Accessibility Coordinator, ITS
> 610-957-6208
>
> *** Swarthmore College ITS will never ask you for your password, including
> by email. Please keep your passwords private to protect yourself and the
> security of our network.
>
> To learn more about web security visit
> http://www.swarthmore.edu/its/security
> > > >