Compliance reference
WCAG 2.1 AA captions: the exact spec, explained for training video
Every training-ops lead eventually asks the same question: "what do captions actually have to do to pass WCAG 2.1 Level AA?" This is the short, specific answer — the success criteria, the accuracy threshold auditors use, and the non-speech sound rule most teams miss.
TL;DR
For prerecorded training video, WCAG 2.1 Level AA requires synchronized captions for all prerecorded audio content (SC 1.2.2, Level A) plus audio description for visual-only content (SC 1.2.5, Level AA). Captions must be verbatim for significant dialogue, identify speakers, and include non-speech sounds that carry meaning (music cues, laughter, alarms). The spec does not name a numeric accuracy target, but the DCMP Captioning Key — the reference most U.S. auditors cite — uses ≈99%.
Which success criteria actually apply
Level AA is cumulative: it inherits every Level A criterion, then adds its own. For prerecorded video specifically, three success criteria govern captions and timed media:
- SC 1.2.1 Audio-only and Video-only (Prerecorded) — Level A. A transcript alternative for pure-audio, and either an audio track or a transcript for pure-video. If your training asset has both audio and video (the normal case), 1.2.1 is not the load-bearing criterion.
- SC 1.2.2 Captions (Prerecorded) — Level A. Synchronized captions for all prerecorded audio content in synchronized media. This is the one that matters for training video. Covered in depth on our SC 1.2.2 page.
- SC 1.2.5 Audio Description (Prerecorded) — Level AA. Audio description narrating significant visual information that isn't in the dialogue — on-screen text, slide changes, demos. This is the AA-specific bump most teams underestimate.
Live training (webinars, town halls) also picks up SC 1.2.4 Captions (Live) at Level AA — a different workflow with different accuracy expectations. GlossCap is prerecorded-only at v1, so 1.2.4 is out of scope here.
What the captions themselves must contain
WCAG 2.1 does not specify a numeric accuracy threshold. It requires that captions be equivalent — conveying the same information a hearing viewer would get from the audio track. In practice, auditors use the Described and Captioned Media Program's Captioning Key as the operational benchmark. Its guidance compresses to:
- Verbatim for dialogue. Transcribe what was said, not a summary. Preserve technical terms exactly — kubectl, tirzepatide, product names, drug names, SDK symbols. This is where YouTube's auto-caption track falls apart on training content and where the hand-correction time goes.
- Identify speakers when not visible.
[Alex]:,[Instructor]:, or speaker-change marks. - Include non-speech sounds that carry meaning.
[alarm],[laughter],[background music]. Sounds that do not affect comprehension (ambient noise, breathing) are omitted. - Synchronize within ≈3 seconds of the corresponding audio. Industry practice is tighter — most caption authoring tools land under 500 ms.
- Reading speed ≤160 words per minute for adult content; compress or break across cues if the spoken pace exceeds it.
- ≈99% accuracy. DCMP's Captioning Key treats 99% as the minimum acceptable word-level accuracy for educational captioning; this is the number procurement questionnaires quote back.
The non-speech sound rule most teams miss
Auto-caption output is fundamentally a speech-to-text pipeline. It does not know a fire alarm sounded or that the audience laughed, so it emits nothing in those moments. A WCAG-compliant caption track has to. An auditor sampling a 45-minute compliance training will look for three specific markers: speaker labels on voiceover sections, bracketed non-speech sounds where they matter, and timing that doesn't drift. Missing any of these on a sampled segment is typically a Level A finding, not just a warning.
How GlossCap helps
GlossCap takes the WCAG-2.1-AA requirements as defaults, not options. Every exported SRT and VTT track is already synchronized within sub-second precision, already marks speaker changes, and already includes non-speech sound cues. The part we spend our engineering on — and what separates us from the SDK→"as decay" mangle that general-purpose speech models produce on engineering and medical content — is the verbatim-for-dialogue requirement. Your company glossary gets pulled from Notion, Confluence, or Google Docs, then biased into Whisper's decoder as logit-boosts on the BPE tokens for each term. The first time you caption a Kubernetes-onboarding module, kubectl comes out right. That is where auditors sample, and that is where the hand-correction hour goes.
Related questions
Does WCAG 2.1 AA require audio description on every training video?
Yes for visual-only information that the audio track doesn't cover — demos, slide transitions, on-screen text, silent visual steps. SC 1.2.5 is the AA-level criterion for this, and it's where compliance programs most often have a gap because teams treat "we have captions" as the whole answer.
Is a transcript on the page enough instead of captions?
For video with audio, no. SC 1.2.1 allows a transcript alternative only for audio-only or video-only content. Training video is synchronized media, which puts it under SC 1.2.2 — you need time-synchronized captions. A transcript is a useful addition, not a substitute. See our captions vs transcripts reference.
What accuracy do auditors actually measure?
In the U.S., DCMP's 99% word-level accuracy is the operational benchmark. In the U.K., Ofcom's guidelines reference 98% for pre-recorded. In practice, if your company-specific terms — product names, acronyms, drug names — are wrong on a sampled segment, that segment fails regardless of aggregate word count.