WebAIM - Web Accessibility In Mind

Real-time Captioning

The Real-time Dilemma

Web multimedia is being used more and more to deliver real-time, live content over the Internet - from video conferencing, to VoIP (Voice-over-Internet Protocol), to live video streaming. Accessibility standards require that equivalent alternatives be provided for audio and visual content. For real-time web multimedia, this means that visual content must be provided in an auditory form and that auditory content must be provided in a visual form. The equivalents must also be synchronized with the presentation, meaning that they must be delivered to the end user at the same time as the main content (e.g., captions for audio must display at the same time that the audio would be heard).

Audio description

The alternative to visual content in standard web media often takes the form of audio descriptions, where visual content that is not also provided in the audio stream of the multimedia is described by a narrator or other person. Audio descriptions are very difficult to incorporate into real-time web broadcasts. As an alternative, you can simply ensure that any visual content is natively described in the audio. For instance, if there is a person speaking on the video, they could audibly describe any additional visual content that is displayed in the movie, thus removing the need for a secondary form of audio description. This is the only feasible way to make live web broadcasts that include visual information accessible to individuals who are blind or have low vision. If produced with this in mind, and if those involved in the video broadcast are aware of and provide these descriptions, then the multimedia will be accessible to these audiences.


The alternative to auditory content in standard web media is usually synchronized captions. Captions provide a textual equivalent of all audible information. The difficulties in generating real-time captions are:

  1. Audio information must be converted into text in real time.
  2. The text captions must be delivered to the end user so they are synchronized with the audio.

Generating Real-time Text

Converting audio information into text in real time is difficult. Unfortunately, few typists can type fast enough to transcribe the spoken word. Thus, there are two primary technologies used to do this.

Stenography/Real-time transcription

steno machineStenography involves having a trained transcriptionist (often called a stenographer or court reporter) that uses a special typewriter-like device called a steno machine to transcribe the spoken word to a text format in real-time. The steno machine has fewer keys (usually 22) than a typical keyboard. Rather than typing each letter, a stenographer hits key sequences on the steno machine to represent phonetic parts of words or phrases, or special codes representing words. Software then analyzes the phonetic information and forms words. Such technology allows a trained transcriptionist to generate text versions of audible conversation in real time.

Stenography allows the audible information to be converted to text in real time (well, perhaps a second or so after it is spoken). While accuracy levels are high, it is common to have words be incorrectly typed or interpreted by the steno software. Also, real-time transcription can be expensive, usually costing around $70-$120 USD per hour.

Voice recognition

While voice recognition offers great possibilities for real-time generation of captions, at this time, the technology is not yet at a level where it can be used to do so. In certain settings, such as when one person is speaking and is using voice recognition software that is well-trained, then voice recognition may be a viable option. Even in such settings, however, there are weaknesses, such as a lack of punctuation, poor accuracy, and inability for other speakers to be captioned.

While voice recognition technology is improving and promises future multi-user, highly accurate, speaker independent voice recognition, at this time, its feasibility in generating text for use in captions is isolated to few situations.

Delivery of Real-time Captions

As soon as the text equivalent of the audio has been generated, that text must be delivered to the end user so it is synchronized with the audio stream. Unfortunately, few real-time multimedia technologies have native support for captioning. Thus, the real-time captions must usually be delivered through a different technology running parallel to the multimedia software or hardware. This is often done through dedicated applications or through clients that are built into a web page and run in a web browser.

For video conferencing and voice chats, where the audio is delivered in real time, the captions must be generated, converted into a format for broadcast across the Internet, and then delivered to the end user - all in real time. For streaming video, there is often a delay between when the media is captured and when it displays to the end user, often due to encoding and buffering. In these cases, the delivery mechanism for the real-time captions must provide functionality for ensuring that the captions display at roughly the same time that the audio would be heard, even if the delay between caption generation and delivery is a long time.


While captioning real-time web multimedia is not always easy, it is possible and should always be done when real-time multimedia is being delivered. Fortunately, the technologies are improving to a level that allows real-time captioning to be both easy and financially viable in most situations.

The technologies used to provide real-time captions over the web are not limited to providing those captions as an alternative to web-based multimedia only. Such caption delivery systems can also be used to provide captions for non-web-based technologies such as radio, television, video conferencing, etc. This will ensure accessibility to all forms of live, real-time multimedia.