Compliance reference

WCAG SC 1.2.2 Captions (Prerecorded): the actual requirement

SC 1.2.2 is the one WCAG success criterion every training-video program bumps into first. It is Level A, which means even the most permissive accessibility floor demands it. Here is the exact text, what "equivalent" means in an audit, and the three operational failures that trip teams up.

TL;DR

SC 1.2.2 requires captions for all prerecorded audio content in synchronized media, except where the media is itself a media-alternative for text and clearly labeled as such. It is Level A — meaning it applies at every WCAG conformance level, including the AA that ADA Title II and the EAA reference. Captions must be synchronized, equivalent to the audio (dialogue plus significant non-speech sounds), and present any time the video has an audio track.

The exact wording

From WCAG 2.1, SC 1.2.2 (unchanged in WCAG 2.2):

Captions are provided for all prerecorded audio content in synchronized media, except when the media is a media alternative for text and is clearly labeled as such. (Level A)

Three terms are load-bearing and worth unpacking.

"Synchronized media" — what counts

A file is synchronized media if it has both an audio track and time-based visual content. A narrated training video, a product-demo capture with voiceover, a talking-head onboarding module, a screen recording with live narration — all synchronized media. 1.2.2 applies.

Edge cases:

"Equivalent to the audio" — what captions have to contain

WCAG is careful to say captions must be equivalent, not identical. A hearing viewer gets words from dialogue and also information from non-speech sounds — an alarm going off, a door slamming, the specific musical sting that signals a scene change. A WCAG-compliant caption track conveys both. The DCMP Captioning Key is the operational reference auditors use, and it lists the content requirements as:

  1. Verbatim dialogue, not summary. Preserve technical terms — product names, SDK symbols, drug names. This is where YouTube auto-captions most frequently fail on training content: kubectl becomes "cube control", Docebo becomes "doh-say-boh".
  2. Speaker identification when the speaker is off-camera or ambiguous.
  3. Meaningful non-speech sounds, bracketed: [alarm], [laughter], [door slams].
  4. Music descriptions when the music communicates something: [ominous music], [upbeat intro].
  5. Accuracy — the DCMP Captioning Key uses ≈99% word-level accuracy as the floor for educational content.

"Synchronized" — what timing has to look like

Caption cues must appear with the audio they correspond to. WCAG does not put a number on it, but DCMP's Captioning Key specifies within about 3 seconds of the corresponding audio. Modern authoring tools — and GlossCap — land under 500 ms. Reading speed matters too: caption cues should not exceed ~160 words per minute for adult content, or they become impossible to read before the next cue appears.

The three failures teams hit most often

From accessibility audits on training-video libraries, the failures cluster tightly:

  1. YouTube auto-captions left as-is. They hit ~80–90% accuracy on generic content and ~60–70% on technical or medical content. SC 1.2.2 fails the moment a sampled segment contains a mangled technical term.
  2. Non-speech sounds omitted. Auto-caption pipelines don't emit them at all; hand-corrected tracks often skip them to save time.
  3. No speaker labels on voiceover segments. When the speaker is off-camera — narrator, instructor voiceover, secondary speaker in a panel — the caption track needs a label.

How GlossCap helps

GlossCap exports SRT and WebVTT caption files that meet SC 1.2.2 out of the box. The accuracy we spend engineering on — specifically, keeping kubectl, tirzepatide, product names, acronyms, and proper nouns verbatim — is what moves a caption track from "~85% accurate" to "audit-ready at ~99%." We pull your company's glossary from Notion, Confluence, or a Google Docs folder and bias it into the speech model's decoder, so the first caption pass preserves your terms instead of mangling them. Speaker labels and non-speech sound markers are defaults, not options.

See pricing

Related questions

What's the difference between 1.2.2 (captions) and 1.2.5 (audio description)?

SC 1.2.2 covers the audio side — anyone who can't hear the audio needs captions conveying it. SC 1.2.5 covers the visual side — anyone who can't see the screen needs audio description of significant visual information that isn't already in the dialogue. You need both for Level AA.

Can we rely on SC 1.2.1 "media alternative" instead of 1.2.2?

Only if your video is literally a media alternative for text that already exists — e.g., a read-along of a document. For typical training content, 1.2.2 is the applicable criterion. The 1.2.1 exception is narrow and explicitly scoped in the spec.

If captions are slightly paraphrased for reading speed, is that still "equivalent"?

Yes. The spec permits compression for readability when spoken pace exceeds ~160 wpm. What is not permitted is compressing so aggressively that meaning is lost, or skipping technical terms because the tool got them wrong. See our captions vs transcripts reference for the full reading-speed discussion.

Further reading