Compliance reference

WCAG 2.1 AA captions: the exact spec, explained for training video

Every training-ops lead eventually asks the same question: "what do captions actually have to do to pass WCAG 2.1 Level AA?" This is the short, specific answer — the success criteria, the accuracy threshold auditors use, and the non-speech sound rule most teams miss.

TL;DR

For prerecorded training video, WCAG 2.1 Level AA requires synchronized captions for all prerecorded audio content (SC 1.2.2, Level A) plus audio description for visual-only content (SC 1.2.5, Level AA). Captions must be verbatim for significant dialogue, identify speakers, and include non-speech sounds that carry meaning (music cues, laughter, alarms). The spec does not name a numeric accuracy target, but the DCMP Captioning Key — the reference most U.S. auditors cite — uses ≈99%.

Which success criteria actually apply

Level AA is cumulative: it inherits every Level A criterion, then adds its own. For prerecorded video specifically, three success criteria govern captions and timed media:

SC 1.2.1 Audio-only and Video-only (Prerecorded) — Level A. A transcript alternative for pure-audio, and either an audio track or a transcript for pure-video. If your training asset has both audio and video (the normal case), 1.2.1 is not the load-bearing criterion.
SC 1.2.2 Captions (Prerecorded) — Level A. Synchronized captions for all prerecorded audio content in synchronized media. This is the one that matters for training video. Covered in depth on our SC 1.2.2 page.
SC 1.2.5 Audio Description (Prerecorded) — Level AA. Audio description narrating significant visual information that isn't in the dialogue — on-screen text, slide changes, demos. This is the AA-specific bump most teams underestimate.

Live training (webinars, town halls) also picks up SC 1.2.4 Captions (Live) at Level AA — a different workflow with different accuracy expectations. GlossCap is prerecorded-only at v1, so 1.2.4 is out of scope here.

What the captions themselves must contain

WCAG 2.1 does not specify a numeric accuracy threshold. It requires that captions be equivalent — conveying the same information a hearing viewer would get from the audio track. In practice, auditors use the Described and Captioned Media Program's Captioning Key as the operational benchmark. Its guidance compresses to:

Verbatim for dialogue. Transcribe what was said, not a summary. Preserve technical terms exactly — kubectl, tirzepatide, product names, drug names, SDK symbols. This is where YouTube's auto-caption track falls apart on training content and where the hand-correction time goes.
Identify speakers when not visible. [Alex]:, [Instructor]:, or speaker-change marks.
Include non-speech sounds that carry meaning. [alarm], [laughter], [background music]. Sounds that do not affect comprehension (ambient noise, breathing) are omitted.
Synchronize within ≈3 seconds of the corresponding audio. Industry practice is tighter — most caption authoring tools land under 500 ms.
Reading speed ≤160 words per minute for adult content; compress or break across cues if the spoken pace exceeds it.
≈99% accuracy. DCMP's Captioning Key treats 99% as the minimum acceptable word-level accuracy for educational captioning; this is the number procurement questionnaires quote back.

The non-speech sound rule most teams miss

Auto-caption output is fundamentally a speech-to-text pipeline. It does not know a fire alarm sounded or that the audience laughed, so it emits nothing in those moments. A WCAG-compliant caption track has to. An auditor sampling a 45-minute compliance training will look for three specific markers: speaker labels on voiceover sections, bracketed non-speech sounds where they matter, and timing that doesn't drift. Missing any of these on a sampled segment is typically a Level A finding, not just a warning.

How GlossCap helps

GlossCap takes the WCAG-2.1-AA requirements as defaults, not options. Every exported SRT and VTT track is already synchronized within sub-second precision, already marks speaker changes, and already includes non-speech sound cues. The part we spend our engineering on — and what separates us from the SDK→"as decay" mangle that general-purpose speech models produce on engineering and medical content — is the verbatim-for-dialogue requirement. Your company glossary gets pulled from Notion, Confluence, or Google Docs, then biased into Whisper's decoder as logit-boosts on the BPE tokens for each term. The first time you caption a Kubernetes-onboarding module, kubectl comes out right. That is where auditors sample, and that is where the hand-correction hour goes.

See pricing