WebAIM - Web Accessibility In Mind

E-mail List Archives

Re: Html from a .pdf file, what is the best way?

for

From: Sean Keegan
Date: Apr 9, 2010 1:33AM


>> There is a big government report being published in a couple of weeks
>> in my
>> home country, over 2000 pages, but one which will interest a lot of
>> people.
>> I was contacted this morning and asked what would be the best way to
>> make
>> its contents accessible to our blind/VI users.
>> They have it as plain text and as a series of .pdf files.

Hi Birkir,

A few questions and thoughts:
- Is the PDF marked up in any way? Is it a tagged PDF with any kind
of semantic markup (e.g., headings, etc.)?
- What was the source application used to create the PDF?

Trying to do semantic markup on a few thousand pages of text can be
tedious (not to mention coordinating page information), but if the PDF
has any markup, you might be able to try a few file conversions. We
often use Nuance applications (mostly OmniPage, but have also used PDF
Converter) to convert a PDF document into a MS Word/.doc file that
maintains a PDF page to MS Word page relationship. From there, you
could go to DAISY or HTML. You can also go straight from OmniPage to
HTML, but I have not tried the last two versions to verify how clean
is the HTML code. In part, it does depend on the PDF and what
information is in that document type.

Whatever you do, I would highly recommend using HTML Tidy or some
other type of document "cleaner" to remove all the extraneous MS Word
markup finds its way into the document if going to HTML. A bit of CSS
and you are all set.

With respect to DAISY vs. HTML - the advantage DAISY offers is that of
page navigation (i.e., you can enter the page number you want and
instantly go to that "page"). There are free plug-ins for MS Word as
well as Open Office that allows for the creation of a DAISY XML book
or even a full-text/full-audio DAISY book. More information about
conversion tools at: http://www.daisy.org/tools/conversion

Ron S. has already outlined some of the advantages in a previous
message so I will not repeat here. One thing to keep in mind is that
the text information in a DAISY file is a XHTML document. So, you are
still dealing with (X)HTML regardless of you choice between DAISY or a
Web page. Some people I have met create a DAISY book, and then copy
out the XHTML file and then add in some customized CSS to
differentiate the two versions. Site visitors are then offered
different versions for consumption (much like Web pages that have a
PDF link for users to download).

You are absolutely correct in that if you have a properly marked up
source document, you can go into different formats, whether it be
DAISY, HTML, etc. The trick is getting that properly marked up source
document ! <grin>

Take care,
Sean