E-mail List Archives

Language markup

for

From: Jukka Korpela
Date: Aug 21, 2002 10:36PM


Marek Prokop wrote:

> there is a
> problem, what I did not find out a solution of so far -- proper
> mixing of languages in HTML elements and their 'title'
> attributes, so the result would be still accesible for people
> with screenreaders.

First, I think it needs to be noted that support to language markup is still
rather limited in user agents, so this question is not (yet) something we
need to get desperate with, even if we can't find good solutions. Besides,
the principles of language markup are still rather vague in the
specifications; the idea looks simpler than it is, and a rule like "Clearly
identify changes in the natural language of a document's text and any text
equivalents (e.g., captions)" is just a beginning for specifying how it
should be done. Note that for e.g. the ALT attribute, it is _impossible_ to
indicate language changes within in, since attribute values are by
definition plain text.

My rule of thumb is: specify the lang attribute for the document as a whole
(in <html lang="...">) and for any part in a different language if it is
longer than a few words, like a block quotation or a book title; for new
documents, use lang markup at the level of words too, in clear-cut cases,
and don't worry too much about the rest. It would be just too much work to
edit all the existing documents to mark up all names etc.

One reason behind the rule of thumb is that people can probably get along
with situations where they hear a foreign name or word pronounced wrong
(effectively, read according to the rules of the language of the surrounding
text), but listening to a paragraph in English read according to Finnish or
Czech rules would be rather inconvenient. (It's funny for the first few
times.)

> To make
> them clear I'm writing them only in English, marking the primary
> language of the document with lower case, while foreign language
> (English) with upper case:
>
> <abbr lang="en" title="CASCADING STYLE SHEETS (cascading style
> sheets)">CSS</abbr>

(I guess you mean that instead of (cascading style sheets) you have that
expression in Czech.)

Is the case difference a good idea here? It's kind-of pseudo-markup, in a
context (attribute value) where you cannot use real markup. Maybe it's
useful in some situations, but it could be confusing, since lower case
versus upper case has so many other uses as well.

Anyway, by HTML specifications, the lang attribute specifies the language of
the content and all attributes. That is, all of the title value, and the
string "CSS". This means that a speech synthesizer that recognizes lang
markup would read "CSS" using the names of the letters in English (unless it
expands all abbreviations with a title using the title, which might be some
people's idea of how title should be used - but it would have rather comic
effects in general, when abbreviations abound). I don't know about the Czech
rules, but in Finnish, abbreviations that are initialisms are almost always
pronounced using Finnish names for letters, irrespectively of its origin.
This implies that we could in principle indicate the few exceptions by using
markup like <abbr lang="en">BBC</abbr>.

> <a href=".." hreflang="en" title="THE ORIGINAL NAME OF THE
> ARTICLE">interesting article about something</a>

In this case you could, in principle, indicate the language of the title
attribute as different from the content of the element, but that would
require extra markup:
<a href=".." hreflang="en" lang="cs" title="THE ORIGINAL NAME OF THE
ARTICLE"><span lang="en">interesting article about something</span></a>

In practice, I wouldn't bother. In this, as well as in the previous example,
I would just omit the lang attribute. It's better to say nothing about
language than to say something that gives wrong information, and it just
isn't useful enough to add extra markup like that.

> Another issue is that Czech uses inflection of nouns. For
> instance, if I want to say "the article by John Smith", I have
> to say "clanek Johna Smithe" in Czech. Notice those 'a' and 'e'
> at the end of the names. Should I code it like:
>
> clanek <span lang="en">John</span>a <span
> lang="en">Smith</span>e

I would say in principle yes. Here, too, a practical option is to omit
language markup when it might give wrong information or would be too complex
or would have practical drawbacks. "When in doubt, leave it out", that's my
motto here.

My experience with IBM Home Page Reader is that it can reasonably well
switch from language to another in general (between the few languages it
supports) but not that well inside a word - there's a noticeable pause, and
it effectively pronounces the parts separately. This results in rather odd
pronunciations especially when the suffix (in a language other than the base
word) is something like "n" or "t". In principle, this is a technical
difficulty that _could_ be solved in speech synthesis technology.

> It doesn't look well and moreover it's even not possible in some
> cases, because inflection could sometimes add some letters
> inside the word.

Indeed, and inflection could affect the base word in other ways too. For
example, for Vergil, the poet, we use the Latin name "Vergilius" in Finnish,
but as adapted to the Finnish system of flexion, so that e.g. the genitive
is "Vergiliuksen", where "Vergiliukse-" is the inflected base and "-n" is
the genitive suffix. Now how should I mark _that_ up? (A practical answer to
regard that name as Finnish and not Latin any more, just as "Vergil" would
better be regarded as English and not Latin. But this would not apply that
well to all names.)

> Hope it's clear enough.

Your questions were certainly clear enough. My answers probably weren't, but
neither is this problem area.

--
Jukka Korpela, senior adviser
TIEKE Finnish Information Society Development Centre
http://www.tieke.fi
Phone: +358 9 4763 0397 Fax: +358 9 4763 0399


----
To subscribe, unsubscribe, or view list archives,
visit http://www.webaim.org/discussion/