E-mail List Archives

Re: Lang Codes

for

From: Jukka K. Korpela
Date: Jan 22, 2005 4:38PM


On Fri, 21 Jan 2005, MacDIllon wrote:

> I'm writing to ask experts about the HTML Language
> attribute for accessibility compliance
> W3C
> Priority 1 - 4.1 check point also related W C
> Priority 3 - Check Point 4.3 Identifying the primary
> language.

First you might wish to note that the WCAG 1.0 document itself violates
the Priority 1 rule. It contains a large number of proper names that are
not in English, but the markup says they are in English. Yet the document
shows off by containing the "W3C WAI-AAA WCAG 1.0" icon, which claims
_full_ compliance to _all_ WCAG 1.0 rules.

Second, the practical impact of language markup is still very small,
and in part even harmful, though not very seriously.

> When I look deeper into the actual specification I
> find different defined language codes.

There is a large number of incompatible language code systems, but the
HTML specification defines the system to be used rather unambiguously,
though not optimally clearly. It first defines that language codes are to
be used by RFC 1766, which leaves many things open. In the references,
it says: "RFC1766 is expected to be updated by
http://www.ietf.org/internet-drafts/draft-alvestrand-lang-tags-v2-00.txt,
currently a work in progress." This is sloppy for a specification, but
apparently the intent is to say that whatever replaces RFC 1766 as an RFC
shall be used. And now that we have RFC 3066, I would say that it applies.

Thus, two-letter codes shall be used for languages that have one
(note that A-Prompt, the nice accessibility checking and repair utility
that's unfortunately not maintained and developed, incorrectly generates
language codes like "eng" instead of "en") according to ISO 639-1
and three-letter codes by ISO 639-2 for the rest. And then there are the
subcodes that may be used (but are supported to very limited extent).

> I assume 639-2 is newer than the previous making it
> most recent and recognize these as synonyms.

The point is that 639-1 defined 2-letter codes for a relatively small set
of languages, and 639-2 defines 3-letter codes for a larger set. It would
be possible to use only the latter, but the W3C decided that in HTML a
more complicated system is used.

> Except I
> do not see the code for "en-US".

The code "en" is a two-letter code by ISO 639-1, hence a correct code to
be used for English language in general. The subcode "US" is a country
identifier, denoting United States. You may use "en-US" to indicate the
so-called American English, and similarly "en-GB" (not "en-UK" as even
Dublin Core documents say!) for British English. In theory, you could even
use "en-FI" for English as spoken in Finland, whatever that means.
I have heard that IBM Home Page Reader selects the pronunciation style
differently for "en-US" than for "en-GB". Whether that's a good idea is
debatable. I would normally prefer a possibility of selecting such a thing
in my user preferences (i.e., use British English pronunciation for all
documents in English), but I can imagine situations where it could make
sense to set such things at the authoring side.

Basically, I would suggest using just the primary code, like "en", without
subcodes. WCAG 1.0 doesn't really take a position on this. Someone might
argue that the language of a document and its parts should be identified
in as much detail as possible.

> a).

Correct.

> b).

Not correct.

> c).

Correct, but not advisable at present IMHO. However as a matter of
principle, it would be more informative and _potentially_ more useful
(assuming the text is really US English). Moreover, there's a minor
potential gain: if you open an HTML document in (a sufficiently new
version of) MS Word in order to check the spelling and grammar,
"en-US" tells it to use US rules.

> I see this example used every
> where in source code but not in the ISO (curious).

The ISO standards only define the primary language codes.

The widespread use of "en-US" is probably caused by authoring software
that spits it out. Needless to say, it is often quite incorrect, since the
actual language might be something different but the software didn't
bother telling its user "hey, I'm labelling your documents as US English,
try to prevent me!".

> 2. If the W3C is a model for teaching standards and
> accessibility, why do they and other "Experts" use the
> older 639-1 two character coding if the 639-2 three
> character is most recent?

In the specific issue of using two-letter language codes, they play
consistently, since it's part of Internet language code approach to prefer
those codes. And ISO 639-2 does not supersede ISO 639-1. They are two
different standards, or two parts of a standard, just as you like.

> I'm just looking for a quick answer because it's
> confusing, not a lecture.
> responses.!

Well, you just got a short lecture that tells that any quick and short
answer is wrong. :-) Sorry, I have no longer lecture to suggest, since
everything written on this seems to be either wrong or shallow, often
both. (I have written a detailed treatise on language codes, but it's only
available in Finnish, sorry.)

But as I wrote, it's really not such a big issue pragmatically. Quick
tips:
- use two-lette language codes (unless the text is in language that
doesn't have one)
- put a lang attribute into the tag to tell the main language
- for any longish text (say, longer than a few words) in another
language, use an appropriate lang attribute, adding or
markup if the piece of text doesn't constitute a logical
element like , or .

--
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/