You are here: Home > Articles > Document and Content Language
The Importance of Identifying Language
Screen readers can "speak" various languages—as long the content language is identified. If the screen reader does not support or cannot speak the defined language, the user might be informed of the content language, even if that content cannot be properly read.
Defining the document language also supports automated translation of content using tools like Google Translate.
For Level A conformance with the Web Content Accessibility Guidelines (WCAG) the document language must be programmatically defined. For Level AA WCAG conformance the language of parts of a page in a language different than the rest of the page must also be identified. This tells the screen reader to switch to that language (if it is able).
Specifying the "language of parts" of the page is only necessary for other-language content that is not generally understood in the document's primary language. "Los Angeles" and "piñata", for example, are Spanish words that are understood by English readers, so it would not be necessary to identify these as being Spanish on an English web page.
Properly defining the content language also allows the browser to properly display quotation marks for various languages when using the <q> element. The following examples are defined as German and French. The browser has generated the localized quotation marks appropriate to the language.
Mein Computer spricht Deutsch.
Mon ordinateur parle français.
Additionally, if the language is specified the browser can present:
The appropriate characters for non-Latin text
Localized date and time inputs (such as using MM/DD/YYYY vs. DD/MM/YYYY or 24-hour time vs. AM/PM time)
Numbers with appropriate comma or period thousands separators
Proper-language spellchecking for inputs
The lang attribute
The lang attribute is used to identify the language of the web page. This attribute must always be added to the <html> tag. It is given a value that identifies the natural language of the page. Adding <html lang="en">, for example, would specify that the page is in English.
Similarly, the lang attribute can be added to other HTML elements within a page to indicate their natural language. <p lang="ja">, for example, would indicate Japanese as the language for the paragraph.
Do not use the lang attribute to specify the language of content that is being linked or navigated to. If a link on an English web page to a Spanish translation presents text of "Spanish", the lang attribute is not used because "Spanish" is an English word. If the link instead presents text of "Español", then lang="es" should be defined on the link.
The Impact of an Incorrect lang Values
When text in one language is read with the pronunciation rules of another, the results can make the content inaccessible. Below is a passage of text in English. The audio recording is a screen reader pronouncing this text as if it had lang="cs" (Czech).
Most people today can hardly conceive of life without the Internet. Some have argued that no other single invention has been more revolutionary since Gutenberg’s printing press in the 1400s. Now, at the click of a mouse, the world can be “at your fingertips”—that is, if you can use a mouse... and see the screen... and hear the audio—in other words, if you don't have a disability of any kind.
Identifying the document language is also important for Acrobat PDF files. The document language can be specified in Acrobat Professional or other PDF editing software.
The primary language is the major language of the web content. Two-character codes are available for nearly all primary languages. lang="en" is English, lang="de" is German, lang="zh" is Chinese, and lang="ar" is Arabic, for example.
Keep the lang attribute value as short as is appropriate. If the two-character primary language code is sufficient to identify the content language, use it.
Many primary languages have different sub-languages or dialects. English has different variants for Great Britain, Australia, and India, for example. Chinese has Mandarin and Cantonese and numerous other dialects, some of which are not mutually intelligible. Despite this, specifying the primary language in a web page is typically sufficient.
In short, the language for the vast majority of web page content can be properly identified with the appropriate two-character primary language code. In some very rare cases a three-character code might be used for a very uncommon primary language, but only if a two-character code does not exist. Even though some ISO standards may define three-character codes for some primary languages (such as "spa" for Spanish), these are not found in the IANA registry and support for these in screen readers is very poor.
Sub-languages or extended languages are available for seven primary languages: Arabic (ar), Chinese (zh), Malay (ms), Swahili (sw), Uzbek (uz), Konkani (kok), and sign languages (sgn). Cantonese and Mandarin, for example, are extended languages of Chinese. It's possible to specify these extended languages using extensions to the primary language, such as lang="zh-yue" for Cantonese vs. lang="zh-cmn" for Mandarin, or via three-character language identifiers such as lang="yue" for Cantonese and lang="cmn" for Mandarin.
Support for extended languages is very poor in screen readers. It is strongly recommended to use either the primary language alone (see above) or, if necessary, an appropriate region subtag (see below).
Content can sometimes be presented using a script or characters that are different from the primary script for a language. For example, 汉语 are the simplified Chinese characters for the word "Chinese". Written in Latin characters this is "Hànyǔ". The word "Hànyǔ" could be identified as Chinese written in Latin script with lang="zh-Latn". Script subtags are always 4 characters added after the primary language and a hyphen.
However, screen reader support for script subtags is poor, and they are very rarely needed. The script identifier is often ignored causing the screen reader to apply the primary language, which typically fails entirely. <p lang="zh-Latn">Hànyǔ</p> on an English page, for example, would likely be considered Chinese due to the "zh" primary language, but the Latin characters are not Chinese characters so the content would likely be unreadable (the screen reader is expecting Chinese characters, not Latin characters). Omitting the lang attribute altogether would cause the Latin characters to be properly read in the page's defined language, English in this case.
In short, script subtags should typically be avoided.
If it becomes necessary to differentiate content in various dialects or sub-languages—such as a page that is highlighting differences between Spanish in Spain and Spanish in Mexico—or if the written content aligns with a dialect that has distinct regional differences, then a region subtag can be used. For example, lang="es-ES" identifies Peninsular Spanish as typically spoken or read in Spain as opposed to lang="es-MX" which identifies Spanish as spoken or read in Mexico.
If a screen reader supports the regional differences—such as having language voices installed for both Peninsular and Mexican Spanish—then the screen reader may switch to the appropriate dialect.
However, region subtags are typically ignored, especially if the screen reader's default language matches the primary language specified. This is because it is presumed the user will prefer and better understand their default dialect over a different dialect of that same language. The numerous screen reader users in Great Britain, for example, will typically hear Great Britain English on U.S. web sites, even if the page has lang="en-US" specified and the user has the US English language voice installed.
Only use region subtags when it is necessary to differentiate content in different dialects that may not be mutually intelligible. A web site that provides content in both Mandarin and Cantonese (one of which may not be understood by speakers of the other) would typically differentiate them using lang="zh-CN" and lang="zh-HK" respectively. Region subtags are much more reliably supported than extended language codes. Because the various dialects of English are mutually intelligible, and because the screen reader user will define their preferred dialect, using lang="en" is typically sufficient even for a site that provides U.K., India, Australian, U.S., or other English-language versions.
There are other variations for the lang attribute value that are permissible, but the rules above apply to the vast majority of web page content.
Keep the language identifier as short as possible. In most cases a two-character language identifier is sufficient and optimal.
Three-character language identifiers, extended language subtags, and script subtags should typically be avoided (or, at a minimum, well tested in screen readers).
Region sub-tags can be used in some cases where it is vital to differentiate dialects.
Screen Reader Support
Two-character lang attribute values are usually adequate for screen reader support. Support for three-character and extended, script, and region subtags varies based on the browser and screen reader in use, and the language voices that are supported and installed. When in doubt, test. Support for inline language changes, such as for a <span> or <img> element is also varied. When possible it is best to define the lang attribute on a block level element, such as a <p>, <blockquote>, or similar.
To read the content in the defined language, the screen reader must support that language. All modern screen readers have support for numerous languages. In some screen reader the user must manually install or configure language voices or "language packs".
If a screen reader encounters a lang attribute which specifies a language for which a matching language voice is not installed or supported, it will usually identify the language of the content. The screen reader might pronounce "Spanish", for example, for content with lang="es" if a Spanish language voice is not enabled or installed.
Screen readers will typically attempt to read content that is pronounceable, even if the defined language is not supported. Polish content, for example, is written in Latin characters, so will read by the screen reader with an English default voice (though it will be read without proper pronunciation, inflections, etc.—perhaps sounding like a beginner Polish class). Chinese characters, on the other hand, are not directly pronounceable in English, so the screen reader would not read them, though it may announce "Chinese" to inform the user that Chinese language content is present.