Quality Operations · Published 2026-06-26

Caption quality error rate calculator: applying the DCMP Captioning Key formula to compare vendors, track progress, and report to leadership

The phrase “99% accuracy” appears in almost every captioning vendor proposal. It also appears in WCAG 2.1 AA Success Criterion 1.2.2 and the related WCAG prerecorded captions guidance. But “99% accuracy” is not self-defining. A vendor who counts errors differently from DCMP Captioning Key protocol can claim 99% on output that actually scores 91% using a consistent methodology. An L&D team that has never calculated error rate from first principles cannot compare two vendor proposals on a meaningful common basis, cannot set a defensible pass/fail threshold in a QA workflow, and cannot tell leadership whether the programme is above or below the compliance line. This post covers the full calculation: what counts as an error under DCMP Captioning Key, what doesn’t, how to draw a valid sample from a production video library, how to run the formula, how to convert the result to an accuracy percentage, how to use the number to compare vendor quotes, and how to track it in a monthly compliance report. An inline tracking template is included at the end.

TL;DR

Five things you need to know before you calculate caption error rate:

  1. The formula is: WER = (S + D + I) / N × 100 — where S is substitutions (wrong word), D is deletions (missing word), I is insertions (extra word), and N is total captionable words in the reference transcript. Accuracy = 100 − WER. The 99% accuracy threshold means WER ≤ 1.0%.
  2. Not every deviation counts as an error. Trivial punctuation differences, acceptable caption splits, and speaker-identification formatting that doesn’t affect meaning do not count. Errors that affect meaning do count, and that includes timing errors >2 seconds that cause words to appear visibly out of sync with the audio.
  3. Sampling matters as much as the formula. You cannot calculate error rate from a 90-second clip you hand-picked because it looked clean. DCMP protocol requires a minimum 10-minute random sample from content >30 minutes, drawn without cherry-picking, with the reference transcript prepared before you look at the caption output.
  4. Vendors measure differently. Some count only substitution errors. Some use machine-scored WER against a transcript they generated themselves (no ground-truth comparison). Some use a different denominator (total caption words rather than reference words). A vendor claiming 99% with no methodology disclosure has told you nothing you can act on.
  5. The number belongs in your monthly compliance report. Error rate is the one metric that directly maps programme state to regulatory requirement. Tracking it monthly by content category (compliance, technical, general) gives leadership a leading indicator of risk before a learner complaint, an OCR enquiry, or a DOL audit surfaces.

Why the calculation methodology matters more than the headline number

Caption accuracy is not a fixed property of an ASR model or a captioning service. It is a function of three variables: the content being transcribed, the methodology used to measure the output, and the ground-truth reference transcript used for comparison. When a vendor says their accuracy is 99%, they are giving you the result of a specific measurement process applied to a specific corpus. Unless you know what that process was, the number is not comparable to any other vendor’s number, and it may not correspond to what you will actually receive on your content.

The Described and Captioned Media Program (DCMP) Captioning Key establishes the standard evaluation protocol for educational media captions in the United States. It is the methodology most frequently cited by accessibility attorneys, OCR complaint respondents, and compliance programme architects when they need to document caption accuracy. Using a consistent, disclosed methodology — and specifically the DCMP methodology — does three things the headline-percentage approach cannot do.

First, it makes vendor comparison possible. If you evaluate three vendor proposals using the same DCMP protocol on the same test corpus, the scores are directly comparable because the denominator, the error categories, and the sampling method are identical. A vendor that scores 98.6% DCMP on your healthcare compliance content is measurably different from one that scores 95.1% DCMP on the same content, and you can document the difference.

Second, it creates a defensible compliance record. If your organisation ever faces an OCR accessibility complaint, a DOL audit, or an ADA Title II enforcement action, your compliance posture is materially stronger if you can demonstrate that you measured accuracy using the standard protocol rather than accepting a vendor’s self-reported number. The caption QA methodology post covers how to structure the review process; this post covers the calculation that sits at the centre of it.

Third, it enables programme-level tracking. A DCMP error rate calculated monthly across a random sample of your production output is a leading indicator of programme health. It tells you whether accuracy is trending toward or away from the 99% threshold before a learner complaint identifies a specific failure. The caption feedback loop post explains how accuracy compounds over time when a production glossary is maintained correctly — the monthly error rate is the metric that makes that compound effect visible.

The baseline question “what is our current caption accuracy?” cannot be answered without a measurement methodology. This post gives you the methodology, the formula, and the tracking infrastructure to answer it.

Error taxonomy: what counts and what doesn’t

The DCMP Captioning Key defines caption errors in four categories. Every deviation you mark during a scoring session must fall into one of these categories to count toward the error rate. Deviations that fall outside the categories do not count, regardless of how they look to a casual reader.

Category 1: Word errors

Word errors are the largest category and the most straightforward to score. There are three types:

Substitutions (S): A word in the caption track that differs from the corresponding word in the reference transcript. This includes phonetic substitutions (the ASR model heard a word that sounds similar to the correct word and substituted it), vocabulary failures (proper nouns, product names, technical terms, or acronyms transcribed incorrectly), and grammatical substitutions (a wrong verb form, tense, or article that changes meaning). Examples:

Deletions (D): A word in the reference transcript that is absent from the caption track. The word was spoken and captionable but not transcribed. This commonly happens at the end of fast speech, across sentence boundaries, and on unstressed function words that the ASR model drops in the presence of background noise. Examples:

Insertions (I): A word in the caption track that has no corresponding word in the reference transcript. The ASR model added a word that was not spoken. This most often occurs at conversational pauses, across speaker changes, or when background audio is mis-interpreted as speech. Examples:

Category 2: Punctuation errors that affect meaning

Not all punctuation differences count. Trivial punctuation differences — an Oxford comma present in one and absent in another, a semicolon where a comma would also be acceptable, em-dash versus en-dash — do not count. What does count is punctuation that changes the meaning of a sentence or creates a compliance risk. The test is: does this punctuation deviation cause a reasonable viewer to misunderstand what was said?

Examples that count:

Examples that do not count:

Category 3: Speaker identification errors

In multi-speaker content — panel discussions, interview formats, dialogue-heavy compliance scenarios, or any content with more than one clearly distinguishable voice — the caption track must correctly attribute speech to the correct speaker when speaker identity is relevant to meaning. Errors in this category include:

Note: Speaker identification conventions vary by caption style guide. Single-speaker narrated training content typically does not require speaker identification. The error category applies only where speaker identity is relevant to comprehension.

Category 4: Timing errors affecting synchronisation

WCAG 2.1 AA Success Criterion 1.2.2 requires that captions be synchronised with the audio. The DCMP Captioning Key treats a timing error as a captionable-word-level error when the visible caption leads or lags the corresponding audio by more than approximately 2 seconds in a way that a viewer would perceive as out of sync. This threshold is based on the psychoacoustic research that informed WCAG prerecorded caption guidance.

Timing errors are counted differently from word errors. A single timing event — a caption block appearing 3.5 seconds after the words are spoken, for example — counts as one timing error for scoring purposes, regardless of how many words are in the block. This prevents a single audio-processing glitch from generating dozens of word-level error credits.

What does not count as an error

The following deviations are explicitly excluded from error counting under DCMP Captioning Key protocol:

Sampling methodology: how to draw a valid sample

The most common mistake in caption quality measurement is not a formula error. It is a sampling error. An organisation that measures accuracy on a 90-second clip the QA reviewer chose because it looked clean — or on the same three demonstration modules used during the vendor sales cycle — has not measured caption accuracy. It has measured those specific clips. The DCMP Captioning Key protocol specifies sampling requirements precisely because selection bias in caption QA is pervasive and because the content types most likely to produce errors are also the content types QA reviewers are least likely to volunteer for a demonstration.

Minimum sample size by content duration

Content duration Minimum sample size Sampling method
Under 10 minutes Full content Score entire video; no sampling needed
10–30 minutes Full content or 10 minutes minimum If sampling, use a random start-point generator; avoid first 60 seconds and final 60 seconds
30 minutes to 2 hours 10 minutes minimum; 15 minutes recommended Two randomly drawn 5-minute segments from different thirds of the video; avoid intro and outro
Over 2 hours 20 minutes minimum Four randomly drawn 5-minute segments, one from each quartile; avoid intro and outro

For programmatic QA of a vendor batch — evaluating an entire month’s caption production, for example — the correct approach is stratified random sampling across content types and source subjects, not cherry-picking a representative sample. Draw 5–10% of a batch randomly, with the sample proportionally distributed across compliance content, technical content, and soft-skills content in the same ratio as the overall batch.

Preparing the reference transcript before scoring

This step is where most in-house QA processes fail. The reference transcript — the word-perfect written record of what was said in the audio, used as the ground truth for error comparison — must be prepared before you look at the caption output. If you prepare the reference transcript while looking at the captions, you are unconsciously anchoring to the caption wording, which causes you to undercount errors. The correct sequence is:

  1. Select the sample segment using the random method above.
  2. Listen to the audio (or watch with captions hidden) and produce the reference transcript verbatim. Include every spoken word, including false starts and filler words if the captioning style guide requires them to be omitted (so you can verify they were omitted correctly). Mark any genuinely unintelligible segments.
  3. Count the total captionable words in the reference transcript. This is your N in the formula. Exclude unintelligible segments and any intentionally omitted elements (filler words, legal-disclaimer speed-reads).
  4. Now compare the reference transcript to the caption output, marking each deviation according to the error taxonomy above.

The reference transcript preparation step takes 3–4 hours of human time per 10 minutes of audio for a careful scorer. This is the real cost of rigorous caption QA. The caption vendor accuracy evaluation methodology post covers how to construct a diagnostic test corpus that makes the reference transcript preparation investment efficient by concentrating it on content types that expose the widest accuracy variance between vendors.

Which content to sample

For ongoing production QA (monthly tracking rather than vendor evaluation), the sample should reflect your actual content mix. If your library is 60% compliance content, 25% technical training, and 15% soft-skills, your monthly sample should draw from those categories in the same proportions. This matters because accuracy varies substantially by content type: a service or model that performs at 98% on soft-skills content may perform at 87% on the same organisation’s compliance terminology. A sample drawn exclusively from soft-skills will overestimate accuracy on the full library.

For the initial error-rate baseline when you are setting up a tracking programme, sample deliberately across your hardest content types first. This gives you the worst-case accuracy figure, which is the figure that determines your actual compliance exposure. The why 99% caption accuracy matters post quantifies what it means for a learner to encounter a caption track at 91% accuracy on a compliance module where the specific vocabulary matters for behaviour change and liability documentation.

The DCMP Captioning Key formula: WER and accuracy conversion

Once you have a reference transcript with the word count N and a completed error-count tally of substitutions S, deletions D, and insertions I, the calculation is two steps.

Step 1: Word Error Rate (WER)

WER (%) = (S + D + I) / N × 100

  • S = number of substitution errors (wrong word)
  • D = number of deletion errors (missing word)
  • I = number of insertion errors (added word)
  • N = total captionable words in the reference transcript

The denominator N is total captionable words in the reference transcript, not total words in the caption output. This distinction matters. A caption track with many insertions will have more total words than the reference transcript. Using caption word count as the denominator artificially inflates N and understates WER. The reference transcript word count is the correct denominator.

Step 2: Accuracy percentage

Accuracy (%) = 100 − WER

The WCAG 2.1 AA requirement for caption accuracy is interpreted under DCMP Captioning Key as a maximum WER of 1.0%, which corresponds to an accuracy of 99.0% or higher. A WER of 1.0% on a 10-minute sample with 1,400 reference words means 14 errors or fewer total (across substitutions, deletions, insertions, and timing errors combined).

Handling timing errors in the formula

Timing errors counted during the sample scoring session are added to the total error count (S + D + I + T) in the numerator, where T is the number of timing events exceeding the 2-second synchronisation threshold. Each timing event counts as 1 toward the numerator, regardless of the number of words in the affected caption block.

In practice, for well-produced professional captioning output, timing errors are rare. They are most common in auto-generated captions from LMS-native engines with poor audio pre-processing, in live-caption archives that have been post-processed without timing correction, and in format conversions (SRT to TTML, for example) where the conversion script introduced a systematic offset.

Worked example: calculating error rate on a 10-minute compliance module

The following example walks through a complete calculation on a fictional 10-minute HIPAA compliance training module. The figures are representative of what a mid-range professional captioning service without a domain-specific glossary would produce on healthcare compliance content.

Setup

Content: 10-minute HIPAA Privacy Rule training module for a hospital system’s nursing staff onboarding programme. Audio quality: standard conference-room recording, single speaker (female, regional US accent), minimal background noise, medium pacing.

Reference transcript word count (N): 1,380 words. (10-minute modules at medium pacing typically run 1,200–1,600 words depending on the presenter’s delivery rate and pause frequency.)

Error tally

Error type Count Examples
Substitutions (S) 18 PHI → FHI (×4), HIPAA → HIPA (×2), covered entity → cover entity (×3), minimum necessary → minimal necessary (×2), de-identification → D identification (×2), Notice of Privacy Practices → Notice of Privacy Practice (×3), business associate agreement → business associate agreement [correct, but “BAA” transcribed as “be a”] (×2)
Deletions (D) 7 Missing “the” before regulatory citations (×3), missing “for” in “required for written authorisation” (×2), missing “not” in a negative obligation statement (×1) [high-severity: changes meaning], missing “only” in “disclosed only for treatment purposes” (×1) [high-severity]
Insertions (I) 3 “the” inserted before a proper noun (×2), “is” inserted at a speaker pause boundary (×1)
Timing errors (T) 1 Single caption block appearing 2.8 seconds after the corresponding audio during a slide transition
Total errors 29

Calculation

WER = (18 + 7 + 3 + 1) / 1,380 × 100 = 29 / 1,380 × 100 = 2.10%

Accuracy = 100 − 2.10 = 97.90%

Interpretation

This result (97.90%) is below the WCAG 2.1 AA threshold of 99.0%. A caption track at this accuracy level on a HIPAA compliance module requires correction before publication if the organisation’s compliance programme mandates 99% DCMP accuracy. The error count of 29 over a 10-minute module means a learner will encounter approximately 4 errors per minute on average — a rate that DCMP research has found sufficient to disrupt comprehension for viewers who rely on captions as their primary means of access.

At the DCMP correction rate of 4× real-time (a trained reviewer spending 40 minutes correcting a 10-minute track), the correction labour cost for this module at an L&D coordinator fully-loaded rate of $45/hour is $30. The hidden half-FTE cost post documents how this correction labour accumulates across a library of 300 modules and why organisations that treat auto-generated captions as “close enough” absorb more labour cost than the cost of a professional service that delivers at 99%.

The two high-severity deletion errors in this example — the missing “not” and the missing “only” — are separately significant. Both create negation reversals in regulatory obligation statements. A learner accessing the module through captions who reads “PHI may be disclosed for treatment purposes” instead of “PHI may be disclosed only for treatment purposes” receives a factually incorrect statement about a regulatory obligation. This category of error — word deletion in a regulatory obligation sentence — is not merely a quality issue but a potential liability documentation failure if the organisation uses caption-based access as evidence of training completion.

The 99% accuracy threshold: what it requires in practice

The 99% threshold is not an industry convention. It is the accuracy level at which DCMP research found that caption users — specifically Deaf and hard-of-hearing viewers using captions as their primary communication channel — could access educational content without significant comprehension impairment. Below 99%, comprehension begins to degrade in proportion to error rate and error type. The 99% accuracy post covers the research in detail; this section focuses on what 99% means as a calculation target.

What 1% error rate allows

On a 10-minute sample at 1,400 reference words, a 1.0% WER allows 14 total errors across all four categories combined (S + D + I + T ≤ 14). This sounds permissive, but 14 errors distributed across 10 minutes of audio is approximately 1.4 errors per minute — still detectable by a viewer relying on captions for access.

Module duration Approx. word count (N) Maximum errors at 99% (WER ≤ 1%)
5 minutes 700 7
10 minutes 1,400 14
15 minutes 2,100 21
30 minutes 4,200 42
60 minutes 8,400 84

These word count estimates assume a moderate pacing of 140 words per minute, which is typical for professional narration in training video production. Faster presenters (160–180 wpm) produce higher word counts and lower error allowances per minute of content.

Content type affects the threshold sensitivity

A 99% threshold on soft-skills content (communication training, general leadership development) means something different from 99% on HIPAA compliance training, Series 7 exam preparation, or OSHA hazard communication instruction. In soft-skills content, most of the words in any given sentence are high-frequency general-vocabulary words that the viewer can reconstruct from context even if the caption is wrong. In compliance content, a single erroneous word — particularly a deleted negation or a substituted regulatory term — can reverse the meaning of a mandatory obligation statement.

For this reason, many caption compliance programmes apply a tiered accuracy threshold: 99% WER as the minimum across all content, with a supplemental requirement for zero meaning-reversing errors in compliance content (no deleted negations, no substituted regulatory citations). A caption track that scores 99.2% DCMP accuracy but contains one instance of a missing “not” before a mandatory obligation still fails the supplemental compliance-content standard.

Using error rate to compare vendor proposals

The standard captioning vendor RFP process produces proposals that include accuracy commitments phrased in ways that are not directly comparable to each other or to DCMP protocol. Common phrasings include: “99% accuracy on professionally produced audio”; “98% or better on standard English audio”; “99% word accuracy on clear, single-speaker audio”; “accuracy guaranteed for audio meeting our quality requirements.” Each of these phrasings contains scope limitations and measurement disclaimers that can substantially affect what accuracy figure you actually receive on your content.

The correct approach is to conduct your own evaluation using DCMP Captioning Key protocol on a test corpus you control, rather than relying on vendor-provided accuracy figures. The vendor accuracy evaluation post covers how to construct the test corpus; this section covers how to use the error-rate results to compare and negotiate.

Comparison table structure

When you evaluate three vendors on the same test corpus using DCMP protocol, your comparison table should include the following fields for each vendor:

Vendor Soft-skills accuracy Compliance accuracy Technical accuracy Medical accuracy With glossary Without glossary Turnaround SLA Price/min
Vendor A 98.8% 94.2% 91.6% 88.3% 98.9% avg 93.2% avg 24h $1.50
Vendor B 97.4% 96.8% 95.2% 93.7% 99.1% avg 95.8% avg 48h $2.10
Vendor C 99.2% 97.1% 96.0% 89.4% 98.4% avg 95.6% avg 24h $1.75

This structure reveals differences that a headline accuracy claim obscures. Vendor A has a significant accuracy cliff between soft-skills content (98.8%) and technical/medical content (91.6%/88.3%). If your content library is 40% technical and 20% medical, Vendor A’s average accuracy on your specific content mix is substantially lower than 98.8%. Vendor B performs more evenly across content types, which matters more than a higher peak on soft-skills. Vendor C performs extremely well on soft-skills but has the same medical accuracy cliff as Vendor A.

The glossary column is particularly important. The difference between with-glossary and without-glossary accuracy shows how much correction labour you will absorb if the vendor’s glossary tooling fails, if your glossary file is rejected during ingestion, or if a new content run starts before the glossary has been updated with new product names. For Vendor A, the glossary failure mode drops average accuracy by 5.7 percentage points. At 30 modules per month, that is the difference between 1.5 expected correction incidents per month and 9 expected correction incidents per month.

Negotiating accuracy SLAs using the formula

When you negotiate the captioning services agreement, the accuracy SLA should specify the measurement methodology explicitly. A vendor accuracy SLA should contain:

  1. The accuracy threshold: “99% or greater DCMP Captioning Key protocol accuracy on all delivered caption tracks.”
  2. The measurement method: “Accuracy measured using DCMP Captioning Key scoring against a reference transcript prepared before caption output review. Word error rate calculated as (S + D + I + T) / N × 100 where N is total captionable words in the reference transcript.”
  3. The remediation trigger: “Any delivered caption track scoring below 97% DCMP accuracy on Client’s spot-check review is subject to free re-delivery within [X] business days.”
  4. The monthly audit right: “Client may conduct DCMP Captioning Key spot-checks on up to 5% of delivered caption tracks per calendar month. Results of such checks shall be shared with Vendor within 10 business days of completion.”

The vendor SLA contract review checklist post documents the full set of clauses an L&D team should negotiate before signature; this accuracy measurement clause is the most important single clause in the agreement.

Accuracy-adjusted cost comparison

The per-minute price alone is not the correct cost metric when comparing vendor proposals. The correct metric is accuracy-adjusted total cost: the per-minute price plus the correction labour cost required to bring the delivered track to 99% DCMP accuracy on your content type mix.

Correction labour formula: expected errors = N × WER, where N is total words per minute of content and WER is the gap between delivered accuracy and 99%. Correction time = expected errors / correction rate (DCMP correction rate = 4× real-time = 40 minutes of correction per 10 minutes of content at 2% WER). Correction cost = correction time × coordinator hourly rate.

Example at 140 words/minute, a vendor delivering 96% accuracy on technical content, coordinator at $45/hour:

The cheaper vendor is 8.7× more expensive in total cost on technical content. This is an extreme example because the 96% delivery accuracy on technical content is low, but it illustrates why the accuracy-adjusted cost calculation often inverts the apparent price ranking of vendor proposals.

Monthly tracking template

The following template supports monthly caption accuracy tracking for an L&D compliance programme. It is designed to generate a single-page dashboard that a programme manager can present to the CHRO, CLO, or accessibility coordinator without requiring the recipient to understand DCMP methodology in detail.

Each row in the tracker represents one scored sample from the month’s production output. The summary rows at the bottom aggregate to the programme-level accuracy figure for the period.

Row-level tracking (one row per scored sample)

Date Module ID Content type Duration (min) Sample start (min) Sample end (min) N (ref words) S D I T Total errors WER (%) Accuracy (%) Pass / Fail Action required
2026-06-02 COMP-0147 Compliance — HIPAA 12 02:00 12:00 1,380 18 7 3 1 29 2.10% 97.90% FAIL Return to vendor for re-delivery
2026-06-05 LEAD-0038 Soft-skills — Leadership 8 00:00 08:00 1,120 4 2 1 0 7 0.63% 99.37% PASS None
2026-06-09 TECH-0092 Technical — Engineering 22 06:00 16:00 1,400 31 5 2 0 38 2.71% 97.29% FAIL Escalate: glossary not applied (engineering terms)
2026-06-12 COMP-0151 Compliance — OSHA 15 03:00 13:00 1,400 6 3 1 0 10 0.71% 99.29% PASS None
2026-06-19 MED-0014 Medical — Clinical procedures 18 04:00 14:00 1,350 27 6 4 2 39 2.89% 97.11% FAIL Return to vendor; check medical glossary version

Summary rows (monthly aggregation for leadership reporting)

Content type Samples scored Total ref words (N) Total errors WER (%) Accuracy (%) Pass rate vs. target (99%)
Compliance 2 2,780 39 1.40% 98.60% 50% −0.40 pp
Technical 1 1,400 38 2.71% 97.29% 0% −1.71 pp
Soft-skills 1 1,120 7 0.63% 99.37% 100% +0.37 pp
Medical 1 1,350 39 2.89% 97.11% 0% −1.89 pp
All content 5 6,650 123 1.85% 98.15% 40% −0.85 pp

How to present this to leadership

The leadership reporting format for the monthly accuracy tracker should be one page, not a spreadsheet. The caption compliance reporting post covers the full reporting structure, but the accuracy section specifically should contain:

  1. Programme-level accuracy vs target: A single sentence — “Programme-wide DCMP accuracy for June 2026 was 98.15% (target: 99.0%); pass rate was 40% (2 of 5 scored samples).”
  2. Content type breakdown: Two sentences identifying which content types are above target and which are below.
  3. Root cause for below-target results: One sentence per failing category identifying the root cause from the tracker action column: “Technical content failures traced to glossary not applied (engineering terms); medical content failures traced to outdated medical glossary version.”
  4. Action taken or planned: One sentence per failure identifying whether the modules were returned to vendor for re-delivery, whether a glossary update was triggered, or whether a vendor escalation is pending.
  5. Trend vs prior month: One sentence noting whether programme accuracy improved or declined from the prior month’s figure.

The programme-level tracking connects directly to the annual caption programme review framework, where the 12-month accuracy trend across all content categories is one of the six agenda items. An organisation that tracks error rate monthly using this template will have 12 data points for the annual review rather than a single self-reported vendor figure.

Eight failure modes in error rate measurement

These are the most common ways caption QA programmes produce misleading error rate results. Each failure mode produces accuracy figures that are higher than the true figure on your content, which means the compliance exposure is higher than the tracking programme shows.

  1. Reference transcript prepared after looking at the caption output

    Anchoring to the caption wording when writing the reference transcript causes the scorer to unconsciously write the caption’s words rather than the audio’s words. This is the most common single failure mode. The reference transcript must be prepared from audio alone, before the caption output is opened.

  2. Sample drawn from content that was manually reviewed before scoring

    If the vendor or the L&D team reviewed and corrected the caption track before submitting it for QA scoring, the score reflects the corrected track, not the delivered track. Vendor QA scoring must be done on the track as delivered, before any internal review or correction step.

  3. Sample drawn exclusively from soft-skills content in a mixed-content library

    A programme with 40% technical content that runs all its QA samples on leadership and communication training will overestimate average accuracy by 3–7 percentage points depending on the content type accuracy differential. Sampling must be stratified by content type in proportion to the library composition.

  4. Using vendor-supplied accuracy figures without independent verification

    Vendor-reported accuracy figures are marketing claims, not DCMP protocol measurements. A vendor that uses machine-scored WER against its own ASR transcript (rather than a human-prepared reference transcript) is reporting a lower-bound error count that systematically misses proper noun failures and contextual substitutions. Independent DCMP scoring on your content is the only source of an actionable accuracy figure.

  5. Counting only substitution errors and omitting deletions

    Some in-house QA processes identify substitution errors because they are easy to spot (wrong word in the caption) but miss deletion errors because a missing word is harder to notice than a wrong word. Deletions are frequently the most consequential error type in compliance content because a missing negation or qualifier reverses the meaning of a regulatory obligation statement. The scoring process must actively look for deletions by reading the reference transcript word by word and verifying that each word appears in the caption output.

  6. Not counting proper noun substitutions as errors because the substituted word “sounds like” the correct word

    A scorer who reads “PHI” in the reference transcript and “FHI” in the caption and thinks “close enough, the learner will understand” is applying a comprehension-inference standard that is explicitly excluded from DCMP Captioning Key methodology. Every substitution that produces a different word is an error, regardless of phonetic similarity. A captioning programme serving Deaf viewers who may not hear the audio to resolve the ambiguity cannot rely on phonetic proximity to justify incorrect word transcription.

  7. Using the caption word count as the denominator instead of the reference word count

    As noted in the formula section: the denominator N is always the reference transcript word count. Using caption word count as the denominator inflates N in tracks with many insertions, producing an artificially lower WER. The error is particularly consequential when comparing vendors, because a vendor that produces more insertions will appear to have a lower WER than a vendor with fewer insertions if caption word count is used as the denominator.

  8. Treating a single batch evaluation as an ongoing measurement

    A QA exercise conducted once during vendor onboarding does not tell you what accuracy you are receiving in month 8 of the contract. Accuracy typically degrades over time if the vendor’s acoustic model is not updated with new content types, if the glossary is not maintained as the organisation’s product and terminology evolve, or if the vendor’s production team changes. Monthly spot-check sampling is the minimum frequency for a compliance programme to maintain a current accuracy figure. The annual review framework documents how to use 12 months of monthly accuracy figures to identify drift before it becomes a compliance event.

FAQ

Can I use an automated speech recognition tool to score caption accuracy instead of a human scorer?
Not as the primary measurement tool for a DCMP Captioning Key compliance score. Automated WER tools (typically used in ASR research) compare two text strings using Levenshtein distance, which counts every word-level deviation equally and has no concept of “meaning-affecting” versus “non-meaning-affecting” errors. They also cannot detect timing errors. Automated scoring is useful as a rapid first-pass screening tool to identify which caption tracks in a large batch are likely to fail human scoring (flagging tracks above 3% automated WER for human review, for example), but it cannot substitute for DCMP human scoring in a compliance programme. Some captioning services use automated WER internally for production QA; this is a legitimate use, but the figure they report is not equivalent to a DCMP Captioning Key score.
What is the difference between WER (word error rate) and CER (character error rate)?
Word error rate counts errors at the word level: a substitution of “Privacy Role” for “Privacy Rule” is one substitution error, regardless of how many characters the wrong word contains. Character error rate counts at the character level: the same substitution would count as 2 character errors (“o” substituted for “u”, same word otherwise). DCMP Captioning Key uses word error rate. CER is used in ASR research to distinguish between model architectures; it is not the standard for caption compliance evaluation.
How should I handle content with significant non-English vocabulary, accents, or bilingual segments?
For content that is primarily English but contains terminology from another language (Spanish safety labels, French product names, Japanese technical specifications), those terms are captionable words and failures to transcribe them correctly count as substitution errors. The reference transcript must include the correct form of the term in whatever language it was spoken. For genuinely bilingual content where a segment switches to another language, the DCMP Captioning Key methodology applies to each language independently within its segment. The reference transcript for the non-English segment must be prepared by a fluent speaker of that language. If no fluent scorer is available for the non-English segment, it should be marked as out-of-scope for the WER calculation with a note in the tracker, and the WER reported as applying only to the English-language portion of the sample.
Our vendor’s contract says “99% accuracy on clear audio with a single speaker.” How do I interpret this?
The phrase “clear audio with a single speaker” is a scope limitation that excludes a substantial portion of real-world training video content from the accuracy guarantee. Content with mild background noise, accented English presenters, multi-speaker scenarios, or technical vocabulary that the vendor’s general ASR model handles poorly may not qualify as “clear audio” under the vendor’s definition — which the vendor controls and which is never documented in the contract. The correct contractual language specifies the methodology (“DCMP Captioning Key protocol accuracy”), the threshold (“99% or greater”), and the measurement right (“as measured by Client’s independent DCMP spot-check”). The contract review checklist post includes the full clause language for negotiation.
How many samples do I need each month for the tracking to be statistically valid?
At a production volume of 30–50 modules per month, a 5-sample monthly spot-check (as illustrated in the tracking template above) provides a reasonable basis for trend detection but not high statistical confidence in the exact accuracy figure. For a larger library (100+ modules/month), a 10% random sample gives you 10+ scored modules per month, which is sufficient for content-type disaggregation at meaningful confidence. For a small production volume (fewer than 20 modules/month), score every module in the highest-risk content category (compliance, medical, technical) and sample randomly from the remaining categories. The goal is not statistical precision but early warning: a monthly tracker that catches a new failure mode (glossary drift, new vendor team, new content type entering production) before it accumulates into a significant compliance exposure.
Does the 99% threshold apply to translated captions as well as English originals?
Yes. WCAG 2.1 AA Success Criterion 1.2.2 applies to captions for audio content in any language, and the 99% DCMP accuracy standard applies equally to translated caption tracks. Translated captions have additional failure modes beyond ASR accuracy: translation errors (wrong word in the target language), text-expansion timing misalignment (French and German translations of English source text typically run 15–25% longer, which can break synchronisation if the caption track was timed against the English source), and font/character encoding issues in East Asian language caption tracks. The multilingual caption workflow post covers the full translation pipeline; error-rate scoring for translated captions uses the same DCMP formula applied to a reference transcript in the target language.
If a module fails the QA score, should I return it to the vendor or correct it in-house?
Vendor return is strongly preferred for modules that fail by more than 1 percentage point (accuracy below 98%). In-house correction of a moderately accurate track (97–98%) is often faster for small batches, but it shifts the correction labour cost from the vendor to your team, normalises the vendor’s below-threshold delivery, and creates a question about whether the in-house corrections were applied consistently and documented. A contract with a free re-delivery clause for tracks below 97% DCMP accuracy removes this choice: vendor re-delivery is the contractual remedy. For modules that fail by less than 1 percentage point (98–99% accuracy), in-house correction with a targeted pass to fix the specific error types identified in the QA log is typically faster than vendor re-submission, provided the errors are not systematic (which would indicate a deeper production failure requiring escalation).

Track caption accuracy without the spreadsheet overhead

GlossCap logs DCMP-protocol error rates automatically on every caption job — substitutions, deletions, insertions, and timing events — and aggregates them into a monthly compliance dashboard your leadership team can read in 90 seconds. No manual scoring setup required.

See GlossCap pricing Try the live demo

Other tools from the same factory: