E-mail List Archives

RE: Language markup


From: Jukka Korpela
Date: Aug 22, 2002 6:00AM

Marek Prokop wrote:

> - - screenreaders try to pronounce text according the lang
> attribute, don't they?

As far as I know, few of them do. Lack of such support seems to have been a
fundamental reason for not including the guideline on language markup into
US "Section 508" rules. (Another reason might be that it would be very
difficult - virtually impossible in general - to check automatically whether
the guideline has been obeyed.) The document
explains that decision:
"The Trace Center advised that only two assistive technology programs could
interpret such coding or markup language, Homepage Reader from IBM and
PwWebspeak from Isound. These programs contain the browser, screen reading
functions, and the speech synthesizer in a single highly integrated program.
However, the majority of persons who are blind use a mainstream browser such
as Internet Explorer or Netscape Navigator in conjunction with a screen
reader. There are also several speech synthesizers in use today, but the
majority of those used in the United States do not have the capability of
switching to the processing of foreign language phonemes. As a result, the
proposed provision that web pages alert a user when there is a change in the
natural language of a page has been deleted in the final rule."

The situation might have improved, but not very much I'm afraid.

This doesn't mean we shouldn't use language markup. But we shouldn't expect
too much either. A speech synthesizer might be able to speak several
languages, but typically so that the user needs to specify the language. The
lang markup is intended to improve the situation, but it's just one part of
the game.

> And what about search engines -- do they
> recognize keywords in different language?

I'm afraid not. Some of them give information about the language of a page
and have the option of searching for pages in a particular language, but as
far as I know, they _guess_ the language. That is, they use "heuristic"
methods based on the textual content, ignoring the markup. Such methods work
most of the time for sufficiently long documents, if the intention is just
to recognize the dominant language. Even very simple heuristics, like
counting character frequencies or the frequencies of short words, usually
give the right answer when deciding between just a few dozens of languages.
But that's very coarse and deals with a document as a whole.

[ about using <span> inside <a> element to assign different language to
the content and an attribute: ]
> Why do you think it isn't useful enough. Don't screenreaders
> read the title attributes of the A tags?

Basically because it's quite some work when carried out for all cases that
might apply, and the lang attribute as a whole is poorly supported. Besides,
I think we _should_ have more advanced methods for specifying the language
of different texts, and maybe we will. Maybe software vendors will provide
general support to lang attributes only after the idea has been described
well enough and widely adopted and applied by authors.

Jukka Korpela, senior adviser
TIEKE Finnish Information Society Development Centre
Phone: +358 9 4763 0397 Fax: +358 9 4763 0399

To subscribe, unsubscribe, or view list archives,
visit http://www.webaim.org/discussion/