The Real-time Dilemma
The web delivers real-time content in many forms, from video conferencing, to VoIP, to live video streaming. Accessibility standards require that equivalent, synchronized alternatives be provided for real-time audio and visual content.
The alternative to visual content in standard web multimedia often takes the form of audio descriptions, where visual content in multimedia is described via audio—often a separate audio track with a narration of visual content. Because a separate descriptive audio track is very difficult to incorporate into real-time web broadcasts, ensure that any important visual content is natively described in the audio. For example, if there is a person speaking on the video, they should audibly describe any additional visual content that is displayed on screen.
Generating Real-time Text
Because few touch-typists can keep pace with natural human speech, transcribers primarily use two technologies to generate text in real time.
Stenography involves a skilled transcriber operating a stenotype keyboard essentially in real time. Since stenotype relies on phonemes and phonetic forms of words, the device has fewer keys than a standard computer keyboard, and skilled operators can achieve over 200 words per minute. Software converts the phonemes or other entered codes to properly-spelled words. It can also be used to quickly make corrections. Stenography can be relatively expensive due to the expertise required of the transcriber.
Accuracy levels for stenography are very high, but proofreading is still recommended. The corrected transcript can also be used for captions for any archives of the live broadcast.
Two forms of voice recognition can be used to generate real-time captions. Both involve computer systems and artificial intelligence converting speech to text, with varying levels of quality. Although accessibility guidelines do not define a quality level for real-time captions, they must be "equivalent," which implies a high level of accuracy.
Well-trained voice recognition or shadow speaking
A "shadow speaker" listens to a live broadcast and repeats all spoken content into a microphone. Specialized voice recognition software, tuned to their voice, interprets the speech as text and facilitates text correction and customization. A trained speaker can also add other important information to the live captions, such as punctuation, identification of who is speaking, or brief descriptions of other visual or audio-only content. Shadow speaking can provide a high level of accuracy, though comes at a cost due to the expertise required.
Automated voice recognition
YouTube, PowerPoint, and other media and streaming technologies can automatically convert speech to text in real time using voice recognition technology. Because of variability in audio quality, differing speaker voices (especially if there are multiple speakers, and more so if they speak at the same time), levels of background noise, etc., raw automated captions tend to have inaccuracies. These technologies may also omit punctuation.
In certain controlled settings, such as when one person is speaking with very clear and annunciated audio, purely automated captioning may sometimes be sufficient. Automated captions can also be useful in generating a transcript and captions for archival media—after a human review and correction process.
Delivery of Real-time Captions
The generated text must be synchronized with the audio stream as closely as possible. Many real-time multimedia technologies, such as YouTube and Zoom, support captioning directly in the interface. When captioning is not directly supported, it may be delivered through dedicated applications or browser-based apps running parallel to the multimedia software or hardware.
Although real-time captioning is not always easy, it is possible—and vital for accessibility. Fortunately, technologies are improving to make real-time captioning easier and cheaper in most situations. Beyond the web, these technologies can also be applied to radio, television, video conferencing, etc., facilitating accessibility across all forms of live media.