WebAIM - Web Accessibility In Mind

Real-time Captioning

The Real-time Dilemma

The web delivers real-time content in many forms, from video conferencing, to VoIP, to live video streaming. Accessibility standards require that equivalent, synchronized alternatives be provided for real-time audio and visual content.

Audio description

The alternative to visual content in standard web multimedia often takes the form of audio descriptions, where visual content in multimedia is described via audio—often a separate audio track with a narration of visual content. Because a separate descriptive audio track is very difficult to incorporate into real-time web broadcasts, ensure that any important visual content is natively described in the audio. For example, if there is a person speaking on the video, they should audibly describe any additional visual content that is displayed on screen.

Captions

The alternative to auditory content in standard web media is usually synchronized captions. Captions provide a textual equivalent of all audible information. The difficulties in generating real-time captions are:

  1. Audio information must be converted into text in real time.
  2. The text captions delivered to the end user must be synchronized with the audio.

Generating Real-time Text

Because few touch-typists can keep pace with natural human speech, transcribers primarily use two technologies to generate text in real time.

Stenography

steno machineStenography involves a skilled transcriber operating a stenotype keyboard essentially in real time. Since stenotype relies on phonemes and phonetic forms of words, the device has fewer keys than a standard computer keyboard, and skilled operators can achieve over 200 words per minute. Software converts the phonemes or other entered codes to properly-spelled words. It can also be used to quickly make corrections. Stenography can be relatively expensive due to the expertise required of the transcriber.

Accuracy levels for stenography are very high, but proofreading is still recommended. The corrected transcript can also be used for captions for any archives of the live broadcast.

Voice recognition

Two forms of voice recognition can be used to generate real-time captions. Both involve computer systems and artificial intelligence converting speech to text, with varying levels of quality. Although accessibility guidelines do not define a quality level for real-time captions, they must be "equivalent," which implies a high level of accuracy.

Well-trained voice recognition or shadow speaking

A "shadow speaker" listens to a live broadcast and repeats all spoken content into a microphone. Specialized voice recognition software, tuned to their voice, interprets the speech as text and facilitates text correction and customization. A trained speaker can also add other important information to the live captions, such as punctuation, identification of who is speaking, or brief descriptions of other visual or audio-only content. Shadow speaking can provide a high level of accuracy, though comes at a cost due to the expertise required.

Automated voice recognition

YouTube, PowerPoint, and other media and streaming technologies can automatically convert speech to text in real time using voice recognition technology. Because of variability in audio quality, differing speaker voices (especially if there are multiple speakers, and more so if they speak at the same time), levels of background noise, etc., raw automated captions tend to have inaccuracies. These technologies may also omit punctuation.

In certain controlled settings, such as when one person is speaking with very clear and annunciated audio, purely automated captioning may sometimes be sufficient. Automated captions can also be useful in generating a transcript and captions for archival media—after a human review and correction process.

Delivery of Real-time Captions

The generated text must be synchronized with the audio stream as closely as possible. Many real-time multimedia technologies, such as YouTube and Zoom, support captioning directly in the interface. When captioning is not directly supported, it may be delivered through dedicated applications or browser-based apps running parallel to the multimedia software or hardware.

Conclusion

Although real-time captioning is not always easy, it is possible—and vital for accessibility. Fortunately, technologies are improving to make real-time captioning easier and cheaper in most situations. Beyond the web, these technologies can also be applied to radio, television, video conferencing, etc., facilitating accessibility across all forms of live media.