E-mail List Archives

Re: Lang attribute and "old" latin


From: Jukka K. Korpela
Date: Apr 25, 2008 12:20AM

John Foliot - Stanford Online Accessibility Program wrote:

> As far as I know, current screen reading technology only supports a
> limited number of languages.

Rather limited, I'm afraid. Moreover, support to language switching on
the basis of language markup (lang or xml:lang attributes) is much more

In practical terms, using language markup at the top level (<html> or
<body> element) is a good move: it takes a very small effort, and it
helps some people. (But then it should be _correct_. It often isn't, so
e.g. Google does not use the information.)

Using language markup at other markup levels, e.g. for individual
paragraphs or even words, is rather pointless, sad to say. There isn't
much support worth mentioning. (I use it, but mostly as a matter of
principle, or habit, and not very consistently. Many W3C pages,
including pages that declare that it should be used, don't use it. Most
web pages don't even make a try, so what motivation is there for
software developers to support it?)

That's the big picture. In details, there's a lot that could be said,
especially about the problems, but this doesn't seem to be an
interesting topic to most people. However, mostly for "academic"
interest, I'll comment on your specific issues:

> I am in the process of reviewing a number of web documents that
> feature, in part, a fair bit of "old Latin" (circa 13th century -
> it's a cool academic project).

I took "old" Latin as referring to pre-classic Latin... Anyway, there's
no useful standardized way to distinguish between different forms of
Latin in language codes. You could use country codes, e.g. "la-GB" to
refer to Latin as used in the United Kingdom, but this would be
anachronistic for 13th century language and also useless.

> At any rate, W3C guidance states
> "Clearly identify changes in the natural language of a document's
> text and any text equivalents (e.g., captions)."

I'm afraid nobody, including the W3C, takes that seriously. It's just
too much trouble with little if any tangible benefit. It's based on
theoretical ideas - largely, law, poorly analyzed ideas - on the
_possible_ usefuless of language markup, rather than actual experience.

> *AND* the ISO code
> for Latin is either "LA" (ISO 639-1) or "LAT" (ISO 639-2) so clearly
> this *CAN* be done.

The technically correct language code for use in markup is "la", with
lowercase as the recommended spelling. HTML and XML specifications refer
to specifications that mandate the use of two-letter codes for languages
that have one.

> As well, wikipedia suggests that "Screen readers without Unicode
> support will read a character outside Latin-1 as a question mark,

Character support is a different issue and should not depend on language
markup, and mostly doesn't.

Generally, in special software like screen readers or specialized
browsers, we should expect character support to be more restricted than
in common modern browsers. Even Latin-1 isn't as safe as in "normal"
browsing. For example, what would a screen reader do upon encountering a
special character like " ΒΆ"? Would it recognize it as having a special
meaning (paragraph separator) and make a pause? Hardly. It probably
spells it out. This might mean saying "pilcrow sign", perhaps
independently of language being used (since characters names aren't
widely localized - most characters don't even _have_ a name in most
languages), which might be complete gibberish even to people who
understand normal English.

> The question is, is there any real advantage gained by adding this
> information (lang="lat") to the content?

Very little if at all. But if used, it should be lang="la".

> I am at a loss to explain any real value
> in doing it to the client as at the end of the day I cannot myself
> find a "real justification" that would improve the accessibility of
> the document.

The best explanation that I could use (if someone offered to pay me for
adding such markup and I needed to soup up "internal" and "moral"
motivation) is the following (and it's lame, so this tells a lot):

If a user opens your HTML page in a word processor like Microsoft Word,
it will use the language markup, and this can be relevant when spelling
checks are "on", i.e. words classified as misspelled are highlighted.
Declaring Latin words as Latin prevents the program from applying
English spelling rules to them. (The copy of Word I just tested seems to
be Latin-ignorant. That is, it recognizes the words being in Latin but
does not flag anything as misspelled and does not even hyphenate Latin
words. But even this is probably better than treating them as English or
some other language.)

On some browsers, like Firefox, the user can right-click on a word and
get information about its language. Sometimes it is useful to know that
a word is Latin. (But what are the odds that a user knows about such

Style sheets, either page or user style sheets, could be used to style
words in a particular language as different from others, using a
selector like [lang="la"] or :lang(la). However, this does not work e.g.
on IE 6, which does not recognize such selectors.

Moreover, some day some browsers or other software could make real use
of the markup.

Jukka K. Korpela ("Yucca")