Technical · Published 2026-04-25

Why 99% caption accuracy matters: the WCAG 2.1 AA threshold, with real training-video examples

Every captioning vendor in the L&D market quotes a 99% accuracy figure. Most of them are not lying — they are measuring different things. This post unpacks where the 99% number comes from (a 1996 standard most buyers have never read), why the way it is measured matters more than the number itself, and what happens when you score the same nine-minute engineering onboarding clip under that protocol with two different transcription engines. The result we got was Whisper-default 91.4%, Whisper-with-glossary 99.2%. The interesting part is which words made up the missing 7.8 points.

TL;DR

The 99% accuracy threshold cited in WCAG 2.1 AA audits comes from the DCMP Captioning Key — a 1996-vintage standard maintained by the Described and Captioned Media Program, federally funded under IDEA. DCMP measures accuracy at the word level on sampled passages with a four-category error taxonomy: substitution, insertion, deletion, and formatting. Most "99%" claims you will read are actually Word Error Rate (WER) numbers measured on conversational-English benchmarks (LibriSpeech, TED-LIUM) where the input vocabulary is mainstream English. Training video has 5–15× the technical-term density of those benchmarks, which is why the WER number on the marketing page rarely survives contact with your domain content. We ran a real audit on a public engineering-onboarding clip and the difference between domain-aware and domain-blind transcription showed up exactly where you would expect: low-frequency proper nouns, command-line tokens, and acronyms. The fix is not "a better model" — it is biasing the decoder you already have toward your glossary.

Where the 99% number actually comes from

Read enough WCAG audit reports and you keep seeing the same sentence: captions must be at least 99% accurate. The WCAG 2.1 standard itself does not contain that number. Success Criterion 1.2.2 (Captions, Prerecorded) requires that "captions are provided for all prerecorded audio content," and the Understanding document talks about equivalence and synchronization, not numeric accuracy. The 99% figure comes from somewhere else.

That somewhere else is the DCMP Captioning Key, the operational standard used by the U.S. Described and Captioned Media Program, which is funded by the Department of Education under IDEA Part D and maintained jointly with the National Association of the Deaf. The Captioning Key is the document captioning vendors quote when they are quoting anything specific. Its accuracy section reads, paraphrased: captions should match the audio content with at least 99% accuracy, including correct identification of speakers, sound effects, and on-screen text where relevant to comprehension. Auditors then cite DCMP because WCAG itself stays silent on the number.

The point is not that 99% is arbitrary; it is that 99% comes with a measurement protocol attached, and that protocol matters. DCMP measures word-level error rate over sampled passages, scoring four error classes — substitution (wrong word), insertion (added word), deletion (missing word), and formatting (incorrect speaker ID, sound effects, or punctuation). Each error counts equally per occurrence. The total accuracy is one minus the error rate over the sample.

How the auto-caption industry quietly redefines accuracy

If you read the methodology footnote on a vendor's accuracy page, you will usually find something like this: Word Error Rate measured on the LibriSpeech test-clean corpus. That sentence quietly does three things to make the number easier to hit.

First, LibriSpeech is a corpus of audiobook readings of pre-1923 public-domain literature. The vocabulary distribution of The Picture of Dorian Gray is not the vocabulary distribution of your AWS-onboarding video. Audiobook narrators speak slowly, clearly, in studio-quality audio, with no overlapping speakers and no domain-specific jargon. Modern transcription models hit the high 90s on LibriSpeech because the corpus is well within their training distribution.

Second, WER on a corpus is an average. A single training video might score 99% in some passages and 70% in others. The average tells you nothing about the worst case — and the worst case is where compliance complaints come from. A learner who needs captions does not benefit from the fact that 99 out of 100 sentences were correct if the missing sentence was the one explaining how to enable two-factor authentication.

Third, WER excludes formatting errors. Speaker identification, on-screen text labels, "[applause]" and "[laughter]" cues, the difference between a colon and a period at a sentence boundary — these are part of DCMP accuracy, but they are not part of WER. A caption file can score 98% WER and still fail DCMP if it omits speaker labels in a multi-speaker training scenario.

None of this is a vendor lie. The vendors are measuring what is easy to measure on the dataset that is easy to use. The mismatch is between that measurement and the audit protocol the buyer is being graded against.

The hard part of training video, in one paragraph

The vocabulary density problem is the spine of the rest of this post, so let us put a number on it. We sampled the transcripts of fifteen public engineering-onboarding videos (Kubernetes, AWS, GitHub Actions, and similar) and counted unique low-frequency tokens — words that appear fewer than 100 times in the standard 1-billion-word English-language web corpus. The mean was 32 unique low-frequency tokens per 10 minutes of video, with substantial repetition (a 30-minute "Intro to Helm" clip might say "Helm" 80 times). On LibriSpeech, the equivalent count is roughly 4 unique low-frequency tokens per 10 minutes, almost all proper nouns from period literature. That is the 5–15× factor at the top of this post in measured form. The model's job in your training video is genuinely harder than its job on its benchmark, by a factor that is not small. See our engineering onboarding captions reference page for the per-domain breakdown.

The side-by-side: a real nine-minute clip, scored two ways

To make this concrete, we picked a short, public, technically-dense source: a nine-minute engineering onboarding video on a popular cloud platform's getting-started flow. The file is open-sourced under a CC license; we are not republishing it, but we ran two transcription passes against it and scored both under the DCMP protocol against a hand-corrected reference transcript.

Both passes used the same underlying model — Whisper-large, run locally on identical audio. The only difference was decoder-side conditioning: one pass had no prior context, the other received a 36-term glossary of platform-specific vocabulary as a soft prompt that biased the decoder's beam search toward those terms when phonetically plausible alternatives existed.

Metric	Whisper-default	Whisper-with-glossary
Total words in reference	1,427	1,427
Substitutions	91	9
Insertions	14	2
Deletions	17	0
Formatting errors (speaker/punct)	0	0
Total errors	122	11
DCMP accuracy	91.4%	99.2%
Pass WCAG 2.1 AA bar?	No	Yes

91.4% is not a bad number on its face. Played at 1× speed, the default transcript is followable; you can tell what the talk is about; sentence-level meaning survives. Where it fails the audit is the substitution count — 91 wrong words in nine minutes — and which words those are.

The substitutions, by category

Of the 91 substitutions in the default pass, the breakdown is:

43 (47%) were technical proper nouns or platform-specific terms — product names, command-line tools, configuration concepts. Examples from the run: "kubectl" → "cooper Netty's," "EKS" → "X," "ConfigMap" → "config map" (rendered as two words, no capital), "Fargate" → "for gate," "kubelet" → "Cooper let," "ingress controller" → "in grass controller."
22 (24%) were acronyms expanded incorrectly or fused with neighboring words. "RBAC" → "are back," "IAM" → "I am," "IAM role" → "I am roll," "VPC" → "VPN" (a substitution to a different acronym), "CIDR" → "cider."
14 (15%) were domain idioms — short phrases that read fluently in context but mean something specific in this domain. "spin up a cluster" → "spending up a cluster," "tear down" (used as a verb for resource deletion) → "tear-down" (rendered as a noun), "k8s" → "Kate's."
8 (9%) were proper names (the speaker referenced a colleague twice and a project lead once) — none correctly transcribed.
4 (4%) were generic English errors of the kind any transcription model produces — homophones, sentence-boundary confusions.

91% of the errors come from the 9% of words that carry the technical content. From the perspective of the learner, those are exactly the words they need the captions to get right. A learner who already speaks fluent English will tolerate the 4 generic English errors without thinking; the same learner cannot reconstruct "kubectl" from "cooper Netty's." This is what the WCAG 2.1 AA audit is actually measuring, even when the auditor cites a single 99% number.

What changed in the second pass

The glossary-conditioned pass dropped substitutions from 91 to 9 with a single intervention: a 36-term list of platform-specific vocabulary, fed to the decoder as a soft prompt. The technique is straightforward — Whisper, like most encoder-decoder transcription models, exposes a "previous-text" prompt that conditions the decoder's beam search. When the audio signal is genuinely ambiguous between "kubectl" and "cooper Netty's," the model's posterior over those token sequences gets nudged by what it has just been told to expect. If the glossary contains "kubectl," the alternative gets selected even when the acoustic signal alone slightly preferred the other.

The 9 remaining substitutions in the glossary pass were all in the "domain idioms" and "proper names" categories — the glossary covered the platform vocabulary but not every name spoken on screen. Adding the speaker's first and last name to the glossary would have closed five of those nine errors, putting the run at 99.6%. We left it at 99.2% because it makes the cleaner point: a 36-term glossary, hand-written in twenty minutes, is the difference between failing and passing the audit. See our 7-day ADA Title II sprint plan for the operational steps that follow from this finding.

What "99% accuracy" should mean to your auditor

If you are writing or revising your accessibility statement this quarter, here is a defensible two-bullet template you can drop in. It commits to a measurement protocol, a sample size, and an explicit exclusion list. Auditors notice when a number comes with a method.

Caption accuracy. All prerecorded video published after 2026-04-24 ships with synchronized captions targeting at least 99% accuracy as defined by the DCMP Captioning Key. Accuracy is measured per asset on a 10-minute sample drawn from the longest contiguous narration block, scored against a hand-verified reference transcript. Errors counted: substitutions, insertions, deletions, and formatting (speaker ID, sound effect cues, on-screen text). Excluded from the sample: untranscribable audio segments such as ambient music interludes, third-party clips embedded with their own captions, and instructor pauses longer than three seconds.

This is more honest than the bare "99% accurate captions" claim because it tells the reader what you mean. An auditor presented with this paragraph will ask to see one or two sample audits, not the whole library — and your sample audit is a one-paragraph table like the one in the previous section.

Things vendors say that should not satisfy you

"Our model achieves 99.5% accuracy on industry benchmarks." Which benchmarks? On what audio? With what error taxonomy? If the answer is "LibriSpeech, WER," that number is a different thing from what your audit measures.
"We have a 99% accuracy guarantee." Guarantee against which protocol, measured how, with what remedy if missed? "Guarantee" is a contractual word; ask to see the contract clause.
"Our humans review every caption." Reviewing for typos and reviewing for technical-term correctness are different jobs. Most human-review tiers staff for the first and miss the second unless the reviewer is a domain expert. Verbit, 3Play, and Rev all run this play to varying degrees — see our Rev vs GlossCap walkthrough for what the human-review tier actually catches and misses on technical content.
"We can train a custom model on your content." Possibly true at enterprise scale (annual contract, ≥1000 seats, dedicated forward-deployed engineer). At mid-market price points, "custom model" usually means "we will add a few terms to a list" — which is exactly the glossary-biased decoding above, dressed up. The dressing changes the price tag, not the technique.
"Auto-captions plus a transcript link satisfy compliance." They do not. Captions and transcripts are different SCs; SC 1.2.2 requires synchronized captions on prerecorded video. A standalone transcript meets a different criterion (1.2.3 or 1.2.8 depending on which level you are auditing to).

One thing to do this week

Pick three of your most-watched training videos. Pull the existing caption file from your LMS (Canvas, Docebo, TalentLMS, Absorb, Kaltura — wherever it lives). Open the captions next to the source transcript or, if no canonical transcript exists, just play the video with captions on. For the first five minutes, write down every wrong word. Sort them by category — was it a generic English error, a technical proper noun, an acronym, or a person's name? If more than half are technical proper nouns or acronyms, you have a glossary problem, not a model problem, and the fix is on the order of half a workday for the first asset and minutes for each subsequent one. Medical-training teams will find a cleaner version of the same pattern: the failing words are drug names, procedure names, and anatomical terms, not generic English.

If you find the 99% bar is not the issue you thought it was — if your existing vendor genuinely does pass DCMP scoring on your domain content — that is excellent news, and you are done. If you find the gap is large and concentrated where this post predicts, the path forward is not a more expensive vendor; it is a vendor whose pipeline includes glossary-biased decoding by default. Ours does. The Team plan covers 30 hours of video per month at $99 — see pricing, or read the case for why we built this if you want the longer story.

FAQ

Is the DCMP Captioning Key actually the binding standard, or is it "just" guidance?

It is technical guidance, not law. WCAG 2.1 AA is the legally referenced standard (in the 2024 DOJ Title II rule, in Section 508, in the EAA). DCMP is the operational protocol auditors and accessibility coordinators most commonly cite when they need to put a number on "what does sufficient caption quality look like?" In practice, demonstrating DCMP-protocol scoring is how you defend a WCAG 2.1 AA caption claim if challenged. There is no contradiction between the two — DCMP is the measurement, WCAG is the requirement.

Why measure on a 10-minute sample instead of the whole asset?

Two reasons. First, scoring requires a hand-verified reference transcript, and producing one for every asset is impractical at L&D scale. Second, sampled DCMP scoring is the protocol the program actually uses for its own materials — the standard is constructed to work over samples. The risk of sampling is bias toward the easy passages; pulling the longest contiguous narration block (rather than the introduction or the closing) reduces that bias because narration density is where errors concentrate.

Can I just hand-correct the auto-captions instead of switching pipelines?

You can, and many teams do. The cost calculation is at the heart of our compliance training captions page: at typical L&D output volumes (20–40 hours a month of new video), hand-correction lands at 1–2 hours per video-hour, which is roughly half an FTE allocated to a task no L&D team is staffed to do. Switching to a pipeline that gets the technical terms right on the first pass cuts the correction work to ~10 minutes per hour of content. The break-even is somewhere around 5 hours of video a month.

Does Whisper-large run on a CPU, or do I need GPUs to do this?

It runs on CPU; it is just slow. A nine-minute clip on a modern desktop CPU finishes in roughly 4–6 minutes wall-clock with a single thread. On a GPU it is real-time or faster. For a self-hosted pipeline at L&D scale, CPU inference on a modest VPS is workable; for batch caption regeneration of a back-catalog, GPU is the cost-effective option. Either way, the glossary-biased decoding does not change runtime materially — it changes the decoder's beam search rules, not the model's compute footprint.

What's in your glossary by default, before my terms are added?

Nothing. The base model has no glossary; it has the language priors it picked up in training. The glossary is per-customer, by design — your medical training team does not benefit from a glossary that includes Kubernetes terms, and the engineering onboarding team does not benefit from one that includes drug names. Pulled-from-Notion or pasted glossaries become the bias signal for that customer's content only. See our WCAG 2.1 AA reference page for the standard alongside our default workflow.