Compliance reference

WCAG 2.1 AA captions: the exact spec, explained for training video

Every training-ops lead eventually asks the same question: "what do captions actually have to do to pass WCAG 2.1 Level AA?" This is the short, specific answer — the success criteria, the accuracy threshold auditors use, and the non-speech sound rule most teams miss.

TL;DR

For prerecorded training video, WCAG 2.1 Level AA requires synchronized captions for all prerecorded audio content (SC 1.2.2, Level A) plus audio description for visual-only content (SC 1.2.5, Level AA). Captions must be verbatim for significant dialogue, identify speakers, and include non-speech sounds that carry meaning (music cues, laughter, alarms). The spec does not name a numeric accuracy target, but the DCMP Captioning Key — the reference most U.S. auditors cite — uses ≈99%.

Which success criteria actually apply

Level AA is cumulative: it inherits every Level A criterion, then adds its own. For prerecorded video specifically, three success criteria govern captions and timed media:

Live training (webinars, town halls) also picks up SC 1.2.4 Captions (Live) at Level AA — a different workflow with different accuracy expectations. GlossCap is prerecorded-only at v1, so 1.2.4 is out of scope here.

What the captions themselves must contain

WCAG 2.1 does not specify a numeric accuracy threshold. It requires that captions be equivalent — conveying the same information a hearing viewer would get from the audio track. In practice, auditors use the Described and Captioned Media Program's Captioning Key as the operational benchmark. Its guidance compresses to:

  1. Verbatim for dialogue. Transcribe what was said, not a summary. Preserve technical terms exactly — kubectl, tirzepatide, product names, drug names, SDK symbols. This is where YouTube's auto-caption track falls apart on training content and where the hand-correction time goes.
  2. Identify speakers when not visible. [Alex]:, [Instructor]:, or speaker-change marks.
  3. Include non-speech sounds that carry meaning. [alarm], [laughter], [background music]. Sounds that do not affect comprehension (ambient noise, breathing) are omitted.
  4. Synchronize within ≈3 seconds of the corresponding audio. Industry practice is tighter — most caption authoring tools land under 500 ms.
  5. Reading speed ≤160 words per minute for adult content; compress or break across cues if the spoken pace exceeds it.
  6. ≈99% accuracy. DCMP's Captioning Key treats 99% as the minimum acceptable word-level accuracy for educational captioning; this is the number procurement questionnaires quote back.

The non-speech sound rule most teams miss

Auto-caption output is fundamentally a speech-to-text pipeline. It does not know a fire alarm sounded or that the audience laughed, so it emits nothing in those moments. A WCAG-compliant caption track has to. An auditor sampling a 45-minute compliance training will look for three specific markers: speaker labels on voiceover sections, bracketed non-speech sounds where they matter, and timing that doesn't drift. Missing any of these on a sampled segment is typically a Level A finding, not just a warning.

How GlossCap helps

GlossCap takes the WCAG-2.1-AA requirements as defaults, not options. Every exported SRT and VTT track is already synchronized within sub-second precision, already marks speaker changes, and already includes non-speech sound cues. The part we spend our engineering on — and what separates us from the SDK→"as decay" mangle that general-purpose speech models produce on engineering and medical content — is the verbatim-for-dialogue requirement. Your company glossary gets pulled from Notion, Confluence, or Google Docs, then biased into Whisper's decoder as logit-boosts on the BPE tokens for each term. The first time you caption a Kubernetes-onboarding module, kubectl comes out right. That is where auditors sample, and that is where the hand-correction hour goes.

See pricing

Related questions

Does WCAG 2.1 AA require audio description on every training video?

Yes for visual-only information that the audio track doesn't cover — demos, slide transitions, on-screen text, silent visual steps. SC 1.2.5 is the AA-level criterion for this, and it's where compliance programs most often have a gap because teams treat "we have captions" as the whole answer.

Is a transcript on the page enough instead of captions?

For video with audio, no. SC 1.2.1 allows a transcript alternative only for audio-only or video-only content. Training video is synchronized media, which puts it under SC 1.2.2 — you need time-synchronized captions. A transcript is a useful addition, not a substitute. See our captions vs transcripts reference.

What accuracy do auditors actually measure?

In the U.S., DCMP's 99% word-level accuracy is the operational benchmark. In the U.K., Ofcom's guidelines reference 98% for pre-recorded. In practice, if your company-specific terms — product names, acronyms, drug names — are wrong on a sampled segment, that segment fails regardless of aggregate word count.

Further reading