E-mail List Archives
Thread: Html from a .pdf file, what is the best way?
Number of posts in this thread: 15 (In chronological order)
From: Birkir Gunnarsson
Date: Wed, Apr 07 2010 10:00AM
Subject: Html from a .pdf file, what is the best way?
No previous message | Next message →
Hey gang
I apologize if this question is borderline topic.
There is a big government report being published in a couple of weeks in my
home country, over 2000 pages, but one which will interest a lot of people.
I was contacted this morning and asked what would be the best way to make
its contents accessible to our blind/VI users.
They have it as plain text and as a series of .pdf files.
I believe a .pdf file of this size (each of them over 150 pges) may cause
problems with Adobe reader accessibility, not unless the buffer is set to 30
pages or less (please correct me if I am wrong here).
Also, if there is a link on page 2 in that document that refers to page,
say, 120, what happens with the Adobe reader in this case,. Assume the
reader clicks on the link, will the reader load page 120 and the followign
30 pages into a buffer?
I am just not sure if .pdf is a good format, I am not sure if the .txt
format is good either, since it does not allow for any textlinks and it is
an awfully large document.
But, assuming I get to the person with the source document, how hard is it
to export it to a marked up html (headings etc)?
I would think that be ideal format for a very lrge document in many
sections, for a blind user.
If anyone has an opinion on this it would be most appreciated.
Thanks
-Birkir
From: Ron Stewart
Date: Wed, Apr 07 2010 10:09AM
Subject: Re: Html from a .pdf file, what is the best way?
← Previous message | Next message →
I would recommend DAISY as the idea format, and HTML as the second best.
DAISY would provide for full navigable and better search ability than HTML.
Ron Stewart
From: Simius Puer
Date: Wed, Apr 07 2010 11:30AM
Subject: Re: Html from a .pdf file, what is the best way?
← Previous message | Next message →
Hi Birkir
That's not borderline at all - it's a very valid question.
I would argue that a single HTML format for both disabled and non-disabled
would be the most accepted approach. Having more than one format creates
additional work in managing the content (unless it is managed from a single
source and output to multiple formats).
Converting PDF to HTML is a huge topic (as is the accessibility of PDFs) and
if you scan the archives on this list you should find plenty of material
covering it.
The key point from all the discussions is "it all depends on the quality and
consistency of the source document". Typically most PDFs are generated from
source documents (such as MS Word) that have little or no structural
mark-up. So converting that PDF (or indeed the source document) to HTML by
an automated tool will ultimately fail. This is not the fault of the source
format, nor the auto-converter, but the skills of the people creating the
source documentation.....rubbish-in, rubbish-out!
From experience, if this is a one-off document then would probably benefit
from a manual conversion carried out by a trained professional.
*Important caveat*: the HTML document would only be accessible as the
website via which it is made available!
One article from our website may be of interest to you:
- Convert PDF, Word (and other formats) to accessible, semantic and lean
HTML
http://www.simiusweb.ie/document_conversion_to_xhtml.html
I'd suggest your best approach would be to find an in-house resource who
understands HTML and accessibility to convert the document manually, or to
use a bureau service provided by a company (that speaks the native language
of the document).
Best regards
From: John Foliot
Date: Wed, Apr 07 2010 1:18PM
Subject: Re: Html from a .pdf file, what is the best way?
← Previous message | Next message →
Ron Stewart wrote:
>
> I would recommend DAISY as the idea format, and HTML as the second
> best.
> DAISY would provide for full navigable and better search ability than
> HTML.
I would have to disagree. DAISY is great if you have a DAISY player, but
for sighted folks like myself, I would much rather have an HTML document
over PDF, and either of those over DAISY (as I have no local means of
consuming it).
I am concerned when single-user-type solutions are proposed that exclude
other types of user, regardless of ability or disability.
If you are going to put in the effort to make this document accessible, go
for the bigger win (with HTML) rather than a limited win. (Depending on
how the document originated, you *might* be able to convert it to HTML
3.2, which could then be cleaned up to be more 'current' -
http://www.adobe.com/products/acrobat/access_onlinetools.html)
Just my $0.02 worth
JF
>
> Ron Stewart
>
>
From: Ron Stewart
Date: Wed, Apr 07 2010 1:54PM
Subject: Re: Html from a .pdf file, what is the best way?
← Previous message | Next message →
John once again I guess we need to agree to disagree. As a reading medium
many feel, including myself, that DAISY3 is superior to standard HTML in
particular when we are talking about large documents.
The original conversation was about accessibility to folks who have B/VI
related disabilities and with a document of 2000 pages I would not look to
HTML first. I for one would not want to have to navigate a document of this
size in HTML.
Here is my reasoning:
1. There are free DAISY players available.
2. Structured HTML as well as indexed Mp3's are a byproducts of the DAISY
production process.
3. Individual preference and the "needs" and those of individuals with
print related disabilities can be easily achieved by a single production
process.
4. The expressed need to have page level navigation once again I think that
points us back to a DTB.
5. The production time to DAISY and to HTML in this instance are equivalent.
In my opinion DAISY is a much more flexible format when it comes to meeting
the needs of a lot of users, but then we all have our opinions.
Ron Stewart
From: Simius Puer
Date: Wed, Apr 07 2010 1:57PM
Subject: Re: Html from a .pdf file, what is the best way?
← Previous message | Next message →
Apologies - that link should have been:
http://www.simiusweb.ie/document_conversion_to_xhtml.htm
...I'm just getting old ;)
oh, and I have to say I agree with John on this one...HTML "is the Web"
From: Birkir Gunnarsson
Date: Wed, Apr 07 2010 4:00PM
Subject: Re: Html from a .pdf file, what is the best way?
← Previous message | Next message →
Simius
Thanks very much for that link, however it seems more an ad for
transcription services than technical analysis of the problem (although I
have some material on what that is and have sent it on).
If you are willing to transcribe Icelandic documents you can certainly
contact me off-list *grin* but there might be language barriers.
Cheers and thanks for all these comments, they are all helpful.
It really boils down to having the source and marking it up properly, after
which html or Daisy is more of a matter of preference in some ways, and both
could be produced without too much overhead.
-Birkir
From: Eoin Campbell
Date: Thu, Apr 08 2010 3:27AM
Subject: Re: Html from a .pdf file, what is the best way?
← Previous message | Next message →
Since its 2,000 pages, the best thing you can do to ensure accessibility
is to have a shorter executive summary that people might actually read.
You should also ensure that there is an internal search facility limited
to the document itself, so that people can quickly search through it to
find bits of relevance to them.
The document should be broken into reasonably sized chunks too,
probably to the section level within chapters.
You should ask for the original source, which must be something like
QuarkXPress or InDesign, I imagine, as this is probably much easier to convert
to structured HTML than PDF or plain text.
Depending on your budget (do you have a budget?), there are companies
and products that attempt to convert PDF into structured formats, but
not sure if they would handle the Icelandic characters and language.
Perhaps a manually formatted accessible summary, and a searchable but less
accessible full-text set of document sections would be a reasonable
compromise.
You should be cautious about the plain-text version. If it is generated
from the PDF, the text might not be in the correct reading order.
Birkir Gunnarsson wrote:
> I apologize if this question is borderline topic.
> There is a big government report being published in a couple of weeks in my
> home country, over 2000 pages, but one which will interest a lot of people.
> I was contacted this morning and asked what would be the best way to make
> its contents accessible to our blind/VI users.
> They have it as plain text and as a series of .pdf files.
> I believe a .pdf file of this size (each of them over 150 pges) may cause
> problems with Adobe reader accessibility, not unless the buffer is set to 30
> pages or less (please correct me if I am wrong here).
> Also, if there is a link on page 2 in that document that refers to page,
> say, 120, what happens with the Adobe reader in this case,. Assume the
> reader clicks on the link, will the reader load page 120 and the followign
> 30 pages into a buffer?
> I am just not sure if .pdf is a good format, I am not sure if the .txt
> format is good either, since it does not allow for any textlinks and it is
> an awfully large document.
> But, assuming I get to the person with the source document, how hard is it
> to export it to a marked up html (headings etc)?
> I would think that be ideal format for a very lrge document in many
> sections, for a blind user.
> If anyone has an opinion on this it would be most appreciated.
--
Eoin Campbell
= EMAIL ADDRESS REMOVED =
From: Simius Puer
Date: Thu, Apr 08 2010 2:24PM
Subject: Re: Html from a .pdf file, what is the best way?
← Previous message | Next message →
Hi Birkir
It wasn't meant to be a technical analysis as such - more an overview of a
common situation that a great many people suffer and a look at different
ways to solve the issue depending on the scale of the problem.
And yes, one solution we offer is a bureau transcription service - I think
you can forgive us for advertising on our own website.
Alas, my Icelandic is a little rusty ;] Whilst it is perfectly possible to
do the conversion without speaking the language, it is impossible to
guarantee accessibility without being able to check the content itself. I
only mention this as there are companies out there who make outrageous
guarantees of accessibility to clients who are not usually in a position to
know any better.
Just a note on marking-up the source document up for auto-conversion (as
that sounds like the route you have chosen) - even if you get this right *
and* you have a reliable conversion tool you still need to check the final
document (in code view - HTML/XML or whatever you decide) to ensure it is
accurate. Just because it looks right visually and validates/parses does
not guarantee accessibility - not by a long way. This was why I suggested
the manual conversion route given your scenario as a you can either:
a) mark-up the source document, auto-convert (having evaluated the
conversion tool) to your chosen code format, and then check the output is
both accurate (in terms of reading order and structure) and accessible
or
b) mark-up the content directly in your chosen code format
sure, you still need to check the document that comes out of option b) but
it normally a lot less work and more accurate.
From: Stephan Wehner
Date: Thu, Apr 08 2010 2:39PM
Subject: Re: Html from a .pdf file, what is the best way?
← Previous message | Next message →
On Wed, Apr 7, 2010 at 8:03 AM, Birkir Gunnarsson
< = EMAIL ADDRESS REMOVED = > wrote:
> Hey gang
>
> I apologize if this question is borderline topic.
> There is a big government report being published in a couple of weeks in my
> home country, over 2000 pages, but one which will interest a lot of people.
> I was contacted this morning and asked what would be the best way to make
> its contents accessible to our blind/VI users.
> They have it as plain text and as a series of .pdf files.
Isn't plain text the best in terms of accessibility? I prefer it
myself when it comes to reading
lengthy stuff.
You might also consider tools like markdown,
http://daringfireball.net/projects/markdown/
and Textile, http://www.textism.com/tools/textile/
It doesn't seem likely they generated the PDF from the plain text
version. Do you know how
they produced the PDF's?
Stephan
> I believe a .pdf file of this size (each of them over 150 pges) may cause
> problems with Adobe reader accessibility, not unless the buffer is set to 30
> pages or less (please correct me if I am wrong here).
> Also, if there is a link on page 2 in that document that refers to page,
> say, 120, what happens with the Adobe reader in this case,. Assume the
> reader clicks on the link, will the reader load page 120 and the followign
> 30 pages into a buffer?
> I am just not sure if .pdf is a good format, I am not sure if the .txt
> format is good either, since it does not allow for any textlinks and it is
> an awfully large document.
> But, assuming I get to the person with the source document, how hard is it
> to export it to a marked up html (headings etc)?
> I would think that be ideal format for a very lrge document in many
> sections, for a blind user.
> If anyone has an opinion on this it would be most appreciated.
> Thanks
> -Birkir
>
>
>
From: Birkir Gunnarsson
Date: Thu, Apr 08 2010 5:15PM
Subject: Re: Html from a .pdf file, what is the best way?
← Previous message | Next message →
Hi Simius
Oh, do not get me wrong, I did not mean to criticize you on your link or
advertizing, merely point out even if I were in the position to contract
such things to you the language is a bit of a barrier *smiles*.
Thanks to you and al the other folks on this list for very helpful input.
I am trying to find out what the source of the document is and I have
forwarded the advice overview from this list to the ministry, after that the
issue is more or less out of my hands.
My problem with plain text files (they have a .txt) is that when it comes to
a report of over 2000 printed pages with chapters and subchapters, tables,
lists etc I prefer a structure to allow me to jump to a particular section
that interests me or a way to jump past a long list of names or a table of
numbers that does not interest me. The first 3 pages is a list of
contributors and "special thanks" people and with Jaws and the arrow down
key, having no ida how long the list will be, just getting past that list
takes a minute or two. For a document of, say, up to 10 or so pages that can
easily be read as a whole plain text is great, for anything this bulky I
feel some indexing and link structure is vital to the quick perusal of said
text.
Thanks
-B
From: Sean Keegan
Date: Fri, Apr 09 2010 1:33AM
Subject: Re: Html from a .pdf file, what is the best way?
← Previous message | Next message →
>> There is a big government report being published in a couple of weeks
>> in my
>> home country, over 2000 pages, but one which will interest a lot of
>> people.
>> I was contacted this morning and asked what would be the best way to
>> make
>> its contents accessible to our blind/VI users.
>> They have it as plain text and as a series of .pdf files.
Hi Birkir,
A few questions and thoughts:
- Is the PDF marked up in any way? Is it a tagged PDF with any kind
of semantic markup (e.g., headings, etc.)?
- What was the source application used to create the PDF?
Trying to do semantic markup on a few thousand pages of text can be
tedious (not to mention coordinating page information), but if the PDF
has any markup, you might be able to try a few file conversions. We
often use Nuance applications (mostly OmniPage, but have also used PDF
Converter) to convert a PDF document into a MS Word/.doc file that
maintains a PDF page to MS Word page relationship. From there, you
could go to DAISY or HTML. You can also go straight from OmniPage to
HTML, but I have not tried the last two versions to verify how clean
is the HTML code. In part, it does depend on the PDF and what
information is in that document type.
Whatever you do, I would highly recommend using HTML Tidy or some
other type of document "cleaner" to remove all the extraneous MS Word
markup finds its way into the document if going to HTML. A bit of CSS
and you are all set.
With respect to DAISY vs. HTML - the advantage DAISY offers is that of
page navigation (i.e., you can enter the page number you want and
instantly go to that "page"). There are free plug-ins for MS Word as
well as Open Office that allows for the creation of a DAISY XML book
or even a full-text/full-audio DAISY book. More information about
conversion tools at: http://www.daisy.org/tools/conversion
Ron S. has already outlined some of the advantages in a previous
message so I will not repeat here. One thing to keep in mind is that
the text information in a DAISY file is a XHTML document. So, you are
still dealing with (X)HTML regardless of you choice between DAISY or a
Web page. Some people I have met create a DAISY book, and then copy
out the XHTML file and then add in some customized CSS to
differentiate the two versions. Site visitors are then offered
different versions for consumption (much like Web pages that have a
PDF link for users to download).
You are absolutely correct in that if you have a properly marked up
source document, you can go into different formats, whether it be
DAISY, HTML, etc. The trick is getting that properly marked up source
document ! <grin>
Take care,
Sean
From: Moore,Michael (DARS)
Date: Fri, Apr 09 2010 8:06AM
Subject: Re: Html from a .pdf file, what is the best way?
← Previous message | Next message →
Ron,
I have a question about the Daisy3 format.
You mentioned that the Daisy production process results in MP3 and structured HTML output. If one of the outputs is structured HTML wouldn't you be able to access it with any browser thus negating the need for a Daisy player, or am I missing something?
Mike Moore
From: Ron Stewart
Date: Fri, Apr 09 2010 8:36AM
Subject: Re: Html from a .pdf file, what is the best way?
← Previous message | Next message →
Morning,
The XHTML file would have to be extracted out of the DTB folder in order to
do it directly and the process for this depends on what type of DTB you are
creating. In A DAISY3 format you will need to extract it from the package
file this can be done with the DAISY Pipeline or with EasyConverter a
commercial package from Dolphin. There are probably other tools as well but
these are the two that I use. A crude way to do this if you want just the
straight text is to change the .XML file to .HTML and then open it in a
browser. This will typically give you the linear text but none of the
navigatibility.
If it is in a DAISY2 format you can actually copy and paste the source from
the DTB folder. You will find two HTML docs the NCC and another that has the
document title. It is this second file that you would open in the web
browser of your choice.
Ron Stewart
From: Christophe Strobbe
Date: Tue, Apr 13 2010 9:27AM
Subject: Re: Html from a .pdf file, what is the best way?
← Previous message | No next message
Hi,
In response to comments on DAISY in Sean's message and several other ones...
At 08:35 9/04/2010, Sean Keegan wrote:
>(...)
>With respect to DAISY vs. HTML - the advantage DAISY offers is that of
>page navigation (i.e., you can enter the page number you want and
>instantly go to that "page"). There are free plug-ins for MS Word as
>well as Open Office that allows for the creation of a DAISY XML book
>or even a full-text/full-audio DAISY book. More information about
>conversion tools at: http://www.daisy.org/tools/conversion
Note that this page is not fully up to date. For example, new
releases of the OpenOffice.org extension - now called odt2daisy -
were made in 9 November last year and 10 April 2010. See
<http://odt2daisy.sourceforge.net/>.
>Ron S. has already outlined some of the advantages in a previous
>message so I will not repeat here. One thing to keep in mind is that
>the text information in a DAISY file is a XHTML document. So, you are
>still dealing with (X)HTML regardless of you choice between DAISY or a
>Web page. Some people I have met create a DAISY book, and then copy
>out the XHTML file and then add in some customized CSS to
>differentiate the two versions. Site visitors are then offered
>different versions for consumption (much like Web pages that have a
>PDF link for users to download).
In DAISY 2.x, the text content is really HTML, but this is not the
case in DAISY 3.0, which also supports MathML and SVG. The
OpenOffice.org extension mentioned above outputs both DAISY 2.02 and
DAISY 3.0 (and since the latest release also playlists for media
players). When you export to DAISY, this extension allows you to
create a CSS file for viewing the DAISY 3 XML in a web browser that
supports XML (i.e. most modern standards-compliant browsers). Of
course, using a software-based DAISY player gives you more options,
for example navigation. For a list of DAISY players (some of which
are browser extensions), see the Wikipedia article on DAISY at
<http://en.wikipedia.org/wiki/DAISY_Digital_Talking_Book>.
Best regards,
Christophe Strobbe
--
Christophe Strobbe
K.U.Leuven - Dept. of Electrical Engineering - SCD
Research Group on Document Architectures
Kasteelpark Arenberg 10 bus 2442
B-3001 Leuven-Heverlee
BELGIUM
tel: +32 16 32 85 51
http://www.docarch.be/
---
"Better products and services through end-user empowerment"
http://www.usem-net.eu/
---
Please don't invite me to LinkedIn, Facebook, Quechup or other
"social networks". You may have agreed to their "privacy policy", but
I haven't.