Quality Operations · Published 2026-06-04

Caption QA for training video production: how to run spot-checks, set pass/fail thresholds, and fix systematic errors

There is a gap that most L&D teams never close: the gap between "we reviewed the captions" and "we have defensible caption accuracy." Editorial review and QA are not the same process. When a content producer reads through a caption file, catches a few obvious errors, and marks the clip as done, that is editorial review — necessary and valuable, but it produces no measured accuracy rate, no error breakdown by type, and no evidence that the result clears a compliance threshold. When an auditor from the DOJ, OCR, or a plaintiff's attorney asks "how do you know your captions are compliant?", "we reviewed them" is not an answer. A QA process is.

A caption QA process is a systematic sampling and measurement procedure. It takes a structured sample of captioned content, counts every word against a defined protocol, scores the result against a threshold, classifies errors by type, and generates a root-cause hypothesis that drives a remediation action. When a QA process is designed correctly, it does three things editorial review cannot: it produces a measured accuracy rate that can be cited in an accessibility statement, it generates an error classification that enables root-cause analysis rather than one-at-a-time corrections, and it creates an audit trail that demonstrates ongoing compliance over time rather than a one-time review.

This post is the operational guide to building and running that process. It covers the DCMP spot-check protocol in enough detail that you can apply it on a specific clip tomorrow, the four error types and what each one signals about the root cause, how to set pass/fail thresholds by content category, LMS-specific QA workflows for eight of the platforms we see most often in enterprise L&D environments, the five-step triage protocol for distinguishing systematic failures from random errors, who owns what in a QA workflow and why that matters, what tooling the process requires, and the eight failure modes that cause teams to plateau at 94–96% accuracy when they should be at 99%. The prior posts in this series — why 99% is the threshold that matters, how to audit your LMS caption library, and the feedback loop that compounds accuracy over time — provide the compliance context and the improvement architecture. This post covers the QA step that sits between "captions exist" and "captions compound."

TL;DR — three things that matter about caption QA

  1. QA is measurement, not review. The DCMP spot-check protocol counts every word in a sampled passage, classifies each error by type, and produces a percentage score. That score either passes or fails a defined threshold. If you do not count every word, you are doing editorial review, not QA — the two processes produce different outputs and serve different purposes.
  2. Error type determines root cause. Substitution errors (wrong word decoded) almost always point to a vocabulary gap — the term is missing from the glossary or has a wrong phonetic entry. Formatting errors (wrong capitalization, missing speaker ID) point to a policy gap. Insertion and deletion errors usually point to audio quality. Treating all errors as equivalent produces the wrong remediation action for each type.
  3. QA that doesn't feed the glossary only catches failures — it doesn't prevent them. A correction that closes a single clip's errors but doesn't update the vocabulary model means the same error will appear in the next clip on the same topic. The QA process is only driving accuracy improvement if every substitution-type error triggers a glossary update. Without that routing, QA is documentation overhead rather than a quality engine.

What caption QA is — and what it is not

The term "QA" is used loosely enough in L&D workflows that it is worth being precise about what we mean. In most team conversations, "QA" means "someone looked at it before it went live." That is not what we mean here, and the distinction matters because the two activities produce different outputs and justify different compliance claims.

Editorial review versus QA

Editorial review is the process of reading a caption file, identifying errors, and correcting them. It is qualitative and subjective: a reviewer uses their judgment to determine whether a caption accurately represents the spoken audio. The output of editorial review is a corrected caption file. Editorial review is necessary — it is how you fix individual errors before content goes live. But it produces no score, no error rate, and no protocol-based verdict. "We reviewed it" describes a process; it does not describe a result.

Quality assurance, in the context of caption accuracy, is a structured measurement process applied to a sample of captioned content. The process is defined by a protocol — specifically, which passages are sampled, how errors are counted, and what accuracy threshold constitutes a pass. The output of QA is a score, an error breakdown, and a pass/fail verdict. "We QA'd it and it passed at 99.2% on the DCMP protocol" describes a result that can be reproduced, cited, and defended.

The caption compliance program a well-structured L&D team builds over time depends on QA results, not editorial review records. When your accessibility statement says "our training video captions are maintained at 99% accuracy," that claim requires a measurement methodology, a sampling cadence, and a record of results. Editorial review produces none of those. QA produces all three.

What a QA process produces

A correctly designed caption QA process produces three things:

A measured accuracy rate. A specific number — 98.7%, 99.1%, 97.4% — calculated by the same method every time on a representative sample. This number is what you cite in an accessibility statement, what you report to a vendor when requesting remediation, and what you compare across months and content types to track improvement trends.

An error breakdown by type. The four error types — substitution, insertion, deletion, and formatting — have different root causes and require different remediation actions. A QA process that produces only an accuracy rate without an error breakdown is like a blood test that reports only "something is off" without specifying which value is out of range. The breakdown is what makes the result actionable.

A root-cause hypothesis. Not every error needs to be traced to its origin, but systematic errors — errors that appear in multiple clips, always on the same vocabulary domain — do. The QA process should produce at least a working hypothesis for each systematic error pattern: this looks like a vocabulary gap, or this looks like an audio quality issue, or this looks like a formatting policy the vendor is not applying. The hypothesis drives the remediation action.

When QA is and isn't the right tool

QA is the right tool for:

QA is not the right tool for:

The most productive framing is this: editorial review is what you do to a specific clip; QA is what you do to a production pipeline. One fixes a file; the other measures and improves a system.

The DCMP spot-check protocol

The most important measurement standard for training video captions is the one published by the Described and Captioned Media Program — the DCMP Captioning Key. The DCMP protocol defines both the accuracy threshold (99% for compliance) and the measurement method: word-level accuracy on sampled passages, with all four error types counted.

The DCMP protocol is the reference point for WCAG 2.1 AA captions because the WCAG normative guidance on SC 1.2.2 Captions (Prerecorded) requires "accurate" captions without quantifying what accuracy means — the DCMP Captioning Key is the most widely accepted operationalization of "accurate" in the US L&D market. When the DOJ's ADA Title II implementing guidance and the revised Section 508 standards reference accuracy, the DCMP 99% threshold is the industry benchmark.

How the protocol works

The DCMP spot-check protocol has four components: sample selection, word count, error classification, and score calculation.

1. Sample selection

The DCMP protocol recommends sampling 5-minute contiguous passages from different sections of the captioned content. For practical application in a training-video production pipeline, we recommend a minimum of three samples per batch:

For a 20-minute compliance training video, three 3-minute samples represents a 45% sampling rate, which is high enough to produce a reliable accuracy estimate. For a library of 50 short-form videos (5–10 minutes each), a 10% spot-check rule is workable: QA 5 clips selected to represent different instructors, different topics, and different production dates.

2. Word count

Count every word in the caption file for your sampled passage. "Every word" means exactly that: every content word, every function word (the, a, and, for), every proper noun, every number, every unit abbreviation. Do not skip words that are "probably right" — the protocol works only if the denominator is complete.

The easiest way to count is to extract the caption text from the .SRT or .VTT file (strip the timestamps and sequence numbers), paste it into a word processor or text editor, and use the word count function. On a 3-minute passage of typical training content, expect 300–500 words. The exact count is your denominator.

3. Error classification

Compare the caption file word by word against a reference — either the script (if one exists) or a reference transcription you produce yourself by listening carefully to the audio. Each discrepancy is an error. Classify each error into one of four types:

Substitution: A word that was spoken was decoded as a different word. "Kubernetes" → "Cube Nettis" is a substitution. "TalentLMS" → "talent ems" is a substitution. "Verbit" → "verb it" is a substitution. Count each substituted word as one error.

Insertion: A word appeared in the caption that was not spoken. Insertions are usually phoneme-level artifacts — the model decoded background noise, breath sounds, or filler sounds as words. Count each inserted word as one error.

Deletion: A word that was spoken does not appear in the caption. Deletions are common when speech is rapid, when two speakers overlap, or when audio quality degrades. Count each deleted word as one error.

Formatting: The word is correct, but the presentation is wrong. Formatting errors include: wrong capitalization (a brand name that should be title-cased rendered in all lowercase), missing punctuation at a clause boundary, wrong speaker identification label, line breaks that split a phrase across frames (phrase unity violation), and caption frames that contain more than two lines. Count each formatting violation as one error. Note that formatting errors are counted separately in the full DCMP taxonomy — if you are producing a score that maps directly to the WCAG accuracy threshold, formatting errors should be counted alongside the other types, because a mis-identified speaker in a training video can alter the meaning of the content.

4. Score calculation

Accuracy percentage = ((total words − total errors) / total words) × 100.

A 350-word passage with 4 errors of any type scores (346/350) × 100 = 98.86%. A 350-word passage with 3 errors scores 99.14%. The 99% threshold on a typical 3-minute passage means you are allowed approximately 3 errors per 300 words, or 5 errors per 500 words.

That sounds lenient until you remember what those errors look like in a compliance training context: "OSHA" rendered as "oshia" is one error, but it is also a regulatory agency name misspelled in a document that may be cited as compliance evidence. "Tier 2 quantity" rendered as "tier too quantity" is two errors, but it is also a material categorization in a HazCom training video. The threshold is 99% because the errors that appear in training content are rarely random — they are systematically the errors that matter most.

A worked example: 3-minute engineering onboarding clip

To make the protocol concrete, here is a worked example from the type of content we see most often: a 3-minute opening section of an engineering onboarding video covering a Kubernetes-based microservices architecture.

Step 1 — Extract the caption text. Open the .SRT file, strip timestamps and sequence numbers, paste the text into a word processor. Word count: 412 words.

Step 2 — Produce the reference. Listen to the audio with the transcript script (or without, if no script exists). Mark every word in the caption text against what was spoken.

Step 3 — Count errors.

ErrorCaption textCorrect textType
1cube nettisKubernetesSubstitution (2 errors: "cube" + "nettis")
2cont container izationcontainerizationInsertion ("cont") + Substitution ("ization" → wrong word boundary)
3the service meshThe service meshFormatting (sentence-start capitalization missing) — 1 error
4istioIstioFormatting (proper noun capitalization) — 1 error
5etcdetcdCorrect (this one is correctly lower-cased)
6[Speaker 1]Alex Chen:Formatting (wrong speaker ID label) — 1 error

Total errors: 7 (2 substitutions for "Kubernetes", 1 insertion, 1 substitution for word boundary, 1 capitalization, 1 proper noun cap, 1 speaker ID). Score: (412 − 7) / 412 × 100 = 98.30%. This clip fails the 99% threshold.

Step 4 — Classify by type. Substitutions: 3 (all proper nouns). Insertions: 1 (phoneme artifact). Deletions: 0. Formatting: 3. Root-cause hypotheses: proper noun substitutions point to vocabulary gap ("Kubernetes", "Istio" likely missing from glossary or with wrong phonetic entries); formatting errors point to speaker ID and capitalization policy not applied.

This is a failing clip with a diagnosable root cause. The remediation is specific: add "Kubernetes" and "Istio" to the glossary with correct phonetic entries; verify that the speaker ID policy is set correctly for this production team's format. Those two changes will likely bring the next clip on this topic to 99%+.

Error taxonomy: what each type signals

The four error types do not have equal weight as signals for root cause. Understanding what each type typically indicates allows you to move from a QA result to a remediation action without having to investigate every error individually.

Substitution errors

A substitution error occurs when the speech recognition model decoded a spoken word as a different word. Substitution errors are the dominant error type in training video QA — they typically account for 55–70% of all errors in domain-specific content.

What substitutions signal. Almost all substitution errors in training video are proper noun substitution failures: a brand name, product name, technical term, person name, or regulatory acronym was decoded incorrectly because the term was not in the glossary or had a wrong phonetic entry. The proper noun failure taxonomy describes 15 categories — product names, API names, chemical names, abbreviations, internal terminology, certification names — each of which produces a characteristic substitution pattern.

Common examples by vertical:

Remediation action for substitutions. Add the term to the glossary with the correct canonical form and phonetic entry. If the term is already in the glossary but still being substituted, check the phonetic entry for correctness and check whether the model is applying the glossary to the content type where the error appears. The feedback loop described in the preceding post is precisely the system that turns each substitution event into a glossary update.

Insertion errors

An insertion error occurs when a word appears in the caption that was not spoken. Insertions are the rarest of the four error types in well-produced training content — they typically account for 8–15% of all errors.

What insertions signal. Insertions are most often caused by audio quality issues: background noise that the model decoded as speech, breath sounds or filler sounds (um, uh) that were transcribed as words, microphone handling noise, and room echo that created phoneme artifacts. In professionally produced training video with a controlled recording environment, insertions are rare. In screen-recorded webinar content, town halls recorded in conference rooms, or field-recorded safety training, insertions can be much more common.

Occasionally, insertions are caused by model hallucination — the model "completing" a phrase it predicts based on the preceding context. This is more common in low-confidence segments and can be distinguished from noise-caused insertions because the inserted word is grammatically plausible even if it was not spoken.

Remediation action for insertions. If insertions are isolated and rare, correct them as individual clip errors. If insertions are systematic in content from a specific production environment (always in the screen-recorded content, never in the studio-recorded content), escalate to the production team for audio quality review. Insertions caused by model hallucination in a specific vocabulary domain can sometimes be reduced by adding context-signal terms to the glossary, but audio quality is the primary lever.

Deletion errors

A deletion error occurs when a word that was spoken does not appear in the caption. Deletions account for approximately 15–25% of errors in training video content.

What deletions signal. Deletions are typically caused by three conditions: rapid speech (words elided by a fast speaker), unclear enunciation (a word spoken at low volume or with heavy accent reduction), and cross-talk (two speakers overlapping, with one voice partially masking the other). All three are primarily audio quality issues rather than vocabulary gaps.

However, a consistent pattern of deleted proper nouns — where the word is clearly audible but simply absent from the caption — often indicates a model confidence issue: the model recognized a phoneme sequence as a low-probability term and deleted it rather than inserting a wrong word. This "deletion-instead-of-substitution" behavior is seen when a term is not in the glossary and the model's confidence in any substitution is below the deletion threshold. If you see deletions on what sounds like a properly spoken proper noun, check the glossary first before investigating audio quality.

Remediation action for deletions. Isolated deletions caused by audio quality are best addressed at the production level (re-record the segment, or manually transcribe the deleted words into the caption file). Systematic deletions of proper nouns are addressable via glossary additions. If the deletion rate is high and broadly distributed across all content types, it indicates a systemic audio quality problem that the caption process cannot compensate for — the audio needs to meet a minimum quality standard before captioning can achieve the 99% threshold.

Formatting errors

A formatting error occurs when the word is correctly decoded but the presentation is wrong. Formatting errors account for 10–20% of errors in our QA data, but they have disproportionate compliance significance because some formatting errors alter the meaning or usability of the caption.

Formatting error types:

Remediation action for formatting errors. Capitalization and speaker ID errors are addressable via glossary and formatting policy settings — most professional caption vendors allow you to specify capitalization rules per term and speaker ID conventions per production project. Persistent line-break violations indicate that the caption segmentation algorithm is not phrase-aware; escalate to the vendor as a formatting policy issue. Timing errors require either re-alignment (a vendor-side technical fix) or clip re-processing.

Setting pass/fail thresholds

The DCMP 99% threshold is the right default for training video content subject to WCAG 2.1 AA compliance requirements, but there are cases where the threshold should be adjusted — and cases where a single threshold is insufficient and you need a tiered framework.

The 99% default and its basis

The 99% threshold comes from the DCMP Captioning Key, which specifies that captions must achieve 99% or higher word-level accuracy to be considered "accurate" under the DCMP standard. This threshold is referenced in WCAG success criterion 1.2.2 guidance and is the number most commonly cited by plaintiff attorneys and DOJ investigators when evaluating whether captions meet the "accurate" requirement under the ADA.

The reason 99% is the right threshold for compliance training (rather than, say, 95%) is that errors in compliance content have materially higher stakes than errors in soft-skills content. A substitution error in a HIPAA training video that converts "minimum necessary standard" to "minimum necessary standard" may seem minor — but a substitution that converts "covered entity" to "covered entity" (same, benign) versus one that converts "breach notification rule" to "reach modification rule" is not minor. In content where the words matter legally, the error rate has to be low enough that no single error changes a material fact.

Threshold tiers by content category

A practical tiered framework for an enterprise L&D library:

Content categoryThresholdRationale
Regulatory compliance (OSHA, HIPAA, FSMA, ADA, Title II)99.0%Regulatory exposure; content may be cited as compliance evidence
Safety training (HazCom, lockout/tagout, fall protection)99.0%Procedural accuracy is a safety requirement; errors in chemical names or safety procedures are high-risk
Product training / sales enablement99.0%Brand accuracy; product name errors damage credibility with sales staff
Technical onboarding (engineering, IT, finance)99.0%High proper-noun density; technical terminology errors affect the learner's ability to apply the content
Medical / clinical training99.0%Drug names, dosing information, clinical terminology — error impact is patient safety
Leadership / soft skills development98.5%Lower proper-noun density; errors in conversational content are less consequential
Executive communications (town halls, strategic updates)98.0%Higher audio variability (Q&A, multiple speakers, room audio); some degradation is acceptable if main message is clear
Live-to-recorded (webinar replay, screen recording)97.5%Source audio quality is variable; threshold adjusted to reflect production constraints, not accuracy target

Note that "live-to-recorded" at 97.5% is an acknowledgment of a production constraint, not an accuracy standard. If a webinar replay is being used as a formal compliance training asset, it should be cleaned to 99% before being given that designation — even if that requires additional editorial review work.

The 3-tier QA model

Not every clip needs the same intensity of QA. A practical framework for matching QA intensity to risk level:

Tier 1 — Spot-check (10% sample, random selection). For routine batches of content in a mature vocabulary domain where the vendor's accuracy is well-established. Sample 10% of clips, 3 minutes per clip. If all sampled clips pass at threshold, approve the batch. If any clip fails, move the batch to Tier 2.

Tier 2 — Targeted QA (all clips in a defined subset). For new content types, new instructors, new vocabulary domains, or any batch that failed Tier 1 spot-check. QA every clip in the targeted subset, using the full 3-sample protocol per clip. Identify the failure mode and remediate before approval.

Tier 3 — Full review (all clips, full pass). For compliance-flagged content (OSHA, HIPAA, ADA), for content produced by a new vendor before that vendor is approved for regular production, and for any clip that will appear in an accessibility statement as evidence of compliance. Full review is editorial review (correct all errors) plus QA measurement (score the result against threshold after corrections are made).

The Tier 1 → Tier 2 → Tier 3 escalation logic is important: Tier 1 passes produce no action; Tier 1 failures escalate to Tier 2; Tier 2 failures on compliance-category content escalate to Tier 3. This keeps the total QA burden proportional to the risk level of the content being reviewed.

Writing a pass/fail decision rule

A pass/fail decision rule is the sentence (or two) that defines what "pass" and "fail" mean for a specific content category. Decision rules should be written down and stored with the QA log so that reviewers are applying the same threshold across different batches and different time periods.

Example decision rules:

The phrase "no material factual error" in the third example is deliberate. For content where the proper-noun density is low, a single substitution that changes a factual claim — "retention increased by 40%" → "retention decreased by 40%" — is disqualifying even if the measured accuracy rate would technically pass. The decision rule should capture this.

LMS-specific QA workflows

The mechanical steps of running a QA spot-check — extracting the caption file, comparing it to the audio, scoring it, re-uploading the corrected version — look different in each LMS. Here are the specific workflows for eight platforms we see most often in enterprise L&D environments.

TalentLMS

TalentLMS stores captions as an associated file on the video unit. To extract a caption file for QA: navigate to the course → content unit → Edit → the caption file is attached as an .SRT or .VTT depending on what was uploaded. Download the file directly from the Edit screen. After correction, re-upload via the same interface — TalentLMS will replace the existing caption file without requiring you to rebuild the course structure. See our TalentLMS captions guide for format-specific requirements.

One TalentLMS-specific issue: if captions were auto-generated by TalentLMS's built-in AI captioning (available in certain plan tiers), the file is stored in a different location than vendor-uploaded captions and may not be exportable without a workaround. In that case, use the browser's developer tools to extract the VTT file from the network requests during video playback. This is a clunky workaround; we recommend uploading vendor-provided caption files rather than using the platform's auto-captioning for any compliance-relevant content.

Docebo

Docebo's caption handling depends on whether the video is hosted in Docebo or in an external player (YouTube, Vimeo, Kaltura). For Docebo-hosted video: navigate to the Content Library → select the video asset → the Subtitles tab shows all caption tracks. Download the .SRT file from the Subtitles tab. After correction, upload the corrected file to the same location — Docebo supports multiple language tracks and will prompt you to confirm which language track you are replacing. See our Docebo captions guide for the full subtitle management workflow.

For Docebo courses with externally hosted video, QA must happen at the source platform (Kaltura, Vimeo, etc.) — Docebo passes the caption track from the external host and cannot override it directly.

Absorb LMS

Absorb handles captions primarily through SCORM packages and direct video uploads. For SCORM-wrapped content, caption files are embedded in the package and must be extracted at the SCORM level (unzip the package, find the .srt or .vtt file in the package directory). For directly uploaded videos, Absorb's video player reads caption files from the video asset settings — access these through Admin → Content → edit the specific asset → Captions. Download the file, score it, upload the corrected version.

Absorb does not have a native auto-captioning feature as of 2026, so all captions in an Absorb library were either uploaded manually or imported via integration. This means the QA process is purely on the vendor-supplied files and the chain of custody is clearer than in platforms with both auto-generated and manual caption tracks.

Cornerstone OnDemand

Cornerstone OnDemand's caption management is handled through the Vantage UI for learning content admins. For transcript-based captions: Learning → Content → select the video object → Transcripts. Download the caption file from this interface. For SCORM content: the process is the same as Absorb — extract from the SCORM package. See our Cornerstone OnDemand captions guide for the full admin workflow.

A Cornerstone-specific nuance: if you are using Cornerstone's Connect integration with Kaltura, the caption files may live in Kaltura rather than in Cornerstone directly. In that case, QA and correction should happen in Kaltura, and the updated caption track will sync back to Cornerstone via the integration.

Workday Learning

Workday Learning manages video content primarily through external hosting integrations (typically Kaltura) or direct content uploads. For directly uploaded content, caption files are managed in the content record — navigate to the Workday Learning administration interface, find the content item, and look for the captions/subtitles section. Workday Learning's caption management interface is less granular than dedicated LMS platforms, which means QA and correction often happen outside Workday and the corrected file is re-ingested via the same upload path the original was created through.

For Workday Learning environments using Kaltura for video hosting, follow the Kaltura workflow below and the corrections will sync to Workday automatically if the integration is correctly configured.

Kaltura

Kaltura has the most mature caption management interface of any platform in this list. Caption tracks are managed as separate assets associated with a media entry. To export a caption file for QA: in KMC (Kaltura Management Console) → select the media entry → Content tab → Caption Assets → download the .SRT or .VTT file. After correction, upload to the same location — Kaltura will version the caption asset and allow you to roll back if needed.

Kaltura also offers built-in machine captioning (REACH) and manual caption ordering through third-party services. If captions are sourced through REACH, the accuracy varies by REACH service level (machine captions vs. human-verified). QA should be applied regardless of the caption source — human-verified captions from REACH still need to be scored against your pass/fail threshold to confirm they meet the standard. See our Kaltura captions guide for the full caption asset workflow and REACH integration overview.

Panopto

Panopto stores captions as part of the video recording object. To export a caption file for QA: open the video in the editor interface → Captions tab → Export captions (available in .SRT and .VTT formats). This exports the current caption state, including any edits made in Panopto's inline caption editor. After scoring and correcting externally, import the corrected file via the same Captions tab → Import captions.

One Panopto-specific workflow issue: Panopto's auto-generated captions are often the starting point for a captioned session (particularly in lecture capture and webinar scenarios), and the auto-generated captions may not have been exported, corrected, and re-imported — they may be living only in Panopto's caption editor. Before running QA, verify that the caption file you are scoring is the one that will actually be served to learners, not a pre-correction draft. If in doubt, export from Panopto first and compare to what is displayed during playback.

WorkRamp

WorkRamp's caption handling is at an earlier maturity level than some of the other platforms in this list. Caption files are associated with content items in the library. Access through the admin panel → Content Library → select the asset → the subtitles/captions option appears under the media settings. WorkRamp accepts .SRT and .VTT formats. After QA and correction, re-upload via the same interface.

For WorkRamp environments using external video hosting (Loom, Wistia, Vimeo), captions are managed at the source platform. QA and correction should happen at the source, and the updated caption track will appear in WorkRamp automatically if the integration is configured correctly. See our WorkRamp captions guide for the full caption delivery workflow and the sales-readiness vocabulary requirements that are particularly relevant for WorkRamp's typical use case.

Systematic error triage: finding the root cause

Individual errors are corrected one at a time during editorial review. Systematic errors — errors that appear in multiple clips, always on the same vocabulary domain — require a triage process that identifies the root cause so that the remediation prevents recurrence rather than just fixing the current instance.

The distinction between random errors and systematic errors is critical. A random error is an isolated event — one deletion in one clip where the speaker spoke unusually fast. A systematic error is a pattern — "Kubernetes" is decoded as "Cube Nettis" in every clip where that word appears. Random errors are corrected at the clip level. Systematic errors must be resolved at the model or glossary level, or they will persist indefinitely regardless of how many clips are individually corrected.

The five-step triage protocol

Step 1 — Classify the error

Determine the error type: substitution, insertion, deletion, or formatting. The type determines which root-cause hypotheses are worth investigating. A substitution almost always means a vocabulary gap; a deletion usually means audio quality; a formatting error usually means policy configuration.

Step 2 — Check the glossary

For substitution errors, check whether the term that was incorrectly decoded is in the glossary. If it is not in the glossary, the root cause is a vocabulary gap — the term needs to be added with the correct canonical form and phonetic entry. If the term is already in the glossary and still being substituted, the phonetic entry may be wrong (the entry does not match the way the term is actually pronounced in your content), or the model may not be applying the glossary to the specific content type where the error appears.

For formatting errors, check the formatting policy settings: is the term in the capitalization exception list? Is the speaker ID format specified for this production team? If the policy is not configured, the model is applying its default rules, which may not match your requirements.

Step 3 — Check the correction history

Search the correction log for prior instances of the same error on the same term. How many times has this error appeared? In which clips? Over what time period? A term that has been manually corrected six times in six consecutive clips is a systematic error that deserves a glossary fix, not a sixth manual correction. A term that has appeared wrong once is more likely a random event.

The correction log is the primary data source for this step. Without a correction log that records the term, the error type, the clip, and the date, it is impossible to distinguish systematic from random errors retrospectively. The log does not need to be sophisticated — a Google Sheet with five columns (date, clip ID, error type, term, canonical form) is enough to reveal patterns in monthly review.

Step 4 — Identify the scope

Determine whether the error is confined to one department's content, one instructor's clips, one production environment, or is system-wide. Scope-limited errors indicate a local condition: a department that uses a specific vocabulary subset that is not in the global glossary, an instructor whose pronunciation of a technical term diverges from the standard, or a studio whose audio setup produces artifacts that trigger insertions.

A system-wide error — the same substitution appearing across all departments, all instructors, all production environments — indicates a missing term that is universally used in your organization. These are the highest-priority glossary additions because a single fix resolves errors across the entire content library going forward.

Step 5 — Assign a remediation action

Based on the preceding four steps, assign a specific remediation action to each systematic error:

Root causeRemediation actionWho owns it
Term missing from glossaryAdd term with canonical form and phonetic entryCaption specialist
Term in glossary, wrong phonetic entryUpdate phonetic entry; test on next clip from that vocabulary domainCaption specialist
Formatting policy not configuredSet capitalization exception or speaker ID ruleCaption specialist + LMS admin
Audio quality causing deletionsEscalate to production team for re-record or audio cleanupL&D manager + content producer
Audio quality causing insertionsReport to vendor for noise-filtering parameter adjustmentCaption specialist
New vocabulary domain, no glossary seedSeed glossary with 20–50 terms before next batch of that content typeCaption specialist + subject-matter expert from department

The remediation action closes the triage loop. Without an assigned action, triage is documentation with no consequence — you know why the error happens, but the next clip on the same topic will still fail.

Building the systematic error log

The systematic error log is the artifact that turns individual QA results into an improvement system. It is distinct from the correction log (which records individual clip corrections) — the systematic error log aggregates patterns.

Recommended columns:

A systematic error log with 20–30 entries covers most of the recurring issues in a typical enterprise L&D vocabulary. The log is reviewed monthly, remediation actions are verified, and closed entries are archived. New patterns are added as they appear in the QA data. Over time, the log is a living picture of the vocabulary frontier — the terms where the caption model is still learning your organization's language.

QA roles and RACI

Caption QA involves four functional roles in most enterprise L&D teams: the L&D manager (or training operations lead), the content producer, the caption specialist, and the LMS administrator. In smaller teams, one person often holds multiple roles. The RACI framework ensures that each QA activity has one Responsible owner, one Accountable approver, and that Consulted and Informed parties are clearly mapped.

QA ActivityL&D ManagerContent ProducerCaption SpecialistLMS Admin
Define QA protocol and thresholdsAICI
Select QA sample (clips for spot-check)CRAI
Extract caption files from LMSICRA
Score clips against DCMP protocolCIR/AI
Log errors by typeIIR/AI
Run systematic error triageCIR/AI
Update glossary with correctionsIIR/AI
Re-upload corrected caption filesICRA
Approve batch for LMS publicationACRI
Generate monthly QA reportAIRI
Escalate systematic failures to vendorR/AICI
Update accessibility statement accuracy claimAICI

R = Responsible (does the work) · A = Accountable (sign-off authority) · C = Consulted · I = Informed

Notes on role design

The caption specialist role is the operational core of the QA process. In organizations without a dedicated caption specialist, this role is typically absorbed by a content producer with extra training or by a vendor-side QA team. If the caption specialist role is vacant and QA is being done ad hoc by different people on different batches, the process is producing inconsistent results and the compliance documentation is unreliable. Caption QA is not a task that benefits from being distributed across the production team — it benefits from being owned by one person who is consistent in how they apply the DCMP protocol.

The L&D manager's role is protocol governance, not execution. The L&D manager sets the threshold, approves the batch for publication, and escalates to vendors. They do not need to score clips personally — they need to be confident that the person who does score clips is applying the protocol correctly and consistently. This means the L&D manager should review a sample of QA reports quarterly and ask "do these scores look right given what I know about the content?" rather than running spot-checks themselves.

LMS admin involvement is primarily about file management. The LMS admin's role in QA is to ensure that corrected caption files are correctly ingested and that the version in the LMS is the post-QA corrected version, not the pre-correction auto-generated version. In environments where caption files are versioned (Kaltura does this natively), the admin is also responsible for maintaining the version history so that it is clear which version of a caption file was serving learners at any given time.

Integrating QA into the production calendar

QA is most effective as a scheduled step in the production calendar rather than an ad-hoc activity that happens when someone has time. Recommended integration points:

The monthly QA report is particularly important if your organization is subject to ADA Title II, Section 508, or state-level accessibility laws. If you are ever asked by an enforcement agency or in litigation "how do you ensure your captions are accessible?", the monthly QA report is the answer — it shows that measurement happened on a defined protocol, at a defined frequency, with a defined threshold, and that the results are documented. Without that documentation, "we review the captions" is not a legally sufficient answer.

Tooling for caption QA

A functional caption QA process requires less specialized tooling than most teams expect. The marginal value of dedicated QA software is low once the process is established; the tools you already have are sufficient.

What you need

A caption file export path. This is platform-dependent (see the LMS-specific workflows above). If your LMS does not allow caption file export, you need to identify a workaround before you can run protocol-based QA at all. Browser developer tools (extracting the .VTT from network requests during playback) are a fallback for locked platforms. If your LMS makes caption export genuinely impossible, this is a vendor conversation about accessibility feature parity — locked caption files are an accessibility barrier, not just an operational inconvenience.

A text editor with word count. Any text editor that can count words works: Microsoft Word, Google Docs, VS Code with a word count extension, even most online character counters. You need the word count to compute your denominator. Paste the caption text (stripped of timestamps and sequence numbers) and count.

A diff comparison tool. When comparing the caption file against a reference transcript, a diff tool highlights discrepancies automatically rather than requiring you to read the two versions word by word. Any of the following work well: diffchecker.com (free, browser-based), the diff view in VS Code, Beyond Compare, or the Compare Documents feature in Microsoft Word. The diff tool is optional — experienced QA reviewers read side-by-side without needing a diff tool — but it reduces the error rate of the QA process itself, especially on longer passages.

A correction log. A Google Sheet or Excel workbook with the columns described in the systematic error triage section above. The correction log is the single most important operational artifact of the QA process — without it, systematic errors cannot be distinguished from random ones, and the QA process cannot improve the model.

Glossary interface access. QA-to-glossary routing is where the process becomes self-improving rather than merely self-documenting. If you are using GlossCap, glossary updates from QA corrections can be submitted directly through the glossary management interface. If you are using a different caption vendor, the mechanism for glossary updates varies — check whether the vendor has a batch glossary import function or requires individual term submissions. If the glossary update path is slow or cumbersome, systematic errors will accumulate faster than they are resolved.

What you do not need

Dedicated caption QA software. Tools exist that automate caption QA by comparing a .SRT file to a reference transcript and generating a score. These tools can be useful at high volumes (500+ clips per month), but for most enterprise L&D teams, they are overkill. The DCMP protocol is straightforward to apply manually on a 3-minute sample, and the manual process teaches reviewers to recognize error patterns in a way that automated scoring does not.

Third-party caption review services. Services that provide human-verified captions from a vendor (REACH from Kaltura, human-review tiers from Rev or 3Play) are different from QA — they are an accuracy improvement service, not a measurement service. You still need to run QA on human-verified captions to confirm they meet your threshold. Paying for human verification and assuming it equals QA is a common false equivalence — human verification improves accuracy; QA measures whether it improved enough.

A dedicated accessibility auditor for every batch. An accessibility auditor is the right resource for building a compliance program from scratch (see the 90-day compliance program guide), for responding to a formal accessibility complaint, or for conducting an LMS library audit (see the caption audit methodology). For ongoing production QA on a mature vocabulary domain, an in-house caption specialist with the DCMP protocol and a correction log is sufficient and substantially cheaper.

The one investment that returns the most

If you have limited budget for tooling improvements, invest it in the correction-to-glossary routing pipeline. The QA process generates the highest-signal vocabulary data in your entire content operation: a reviewer who correctly identifies a substitution error is telling the model exactly what term was spoken, exactly what phoneme sequence it was decoded as, and exactly what the canonical form should be. That is three dimensions of training signal in a single correction event.

If that correction data is stored in a spreadsheet that no one reads, you are generating the signal and discarding it. If it is routed to the glossary as a term-and-phonetic-entry update, it is permanently captured and makes every future session on the same vocabulary domain more accurate. The feedback loop post explains the compounding effect in detail — the short version is that at 200+ captioned hours, a team with a correction-routing pipeline is 3–5 percentage points more accurate than one without, on the same underlying content.

Common failure patterns: why teams plateau at 94–96%

The eight failure modes below are the most common reasons teams that are actively doing caption QA still fail to reach and maintain 99% accuracy. Each one is diagnosable and correctable — the goal of this list is to give you a quick checklist for root-cause analysis when your QA results plateau.

Failure mode 1: Sampling too thin

A 1-in-20 spot-check on a batch with a 10% error cluster means the cluster has an 80% chance of not being sampled. Teams that run light spot-checks on large batches are producing QA results that reflect the easiest content in the batch, not the average. The DCMP protocol specifies minimum sample sizes; the 3-sample-per-clip protocol (opening, middle, end) reflects the distribution of error types across a clip — an opening-only spot-check systematically underweights the systematic errors that appear in technical dense-vocabulary sections.

Diagnosis: Compare the QA score on sampled clips to the score when you run a full clip review on those same clips. If the sampling score is consistently higher than the full-clip score, your sample is not representative. Increase sample length or sample count.

Failure mode 2: Scoring at the paragraph level

The DCMP protocol counts every word. Teams that visually scan a caption file and mark it as "looks good" are not doing protocol-based QA — they are doing editorial review with a lower correction threshold than the DCMP standard requires. The difference between 98.5% and 99% on a 400-word passage is two errors. Visual scanning will miss one of those two errors at least half the time.

Diagnosis: Take a batch that visually scanned to "pass" and run the full word-count protocol on the same clips. If the measured score is lower than the visual estimate, your QA is understating the error rate. Enforce the word-count method for all protocol-based QA.

Failure mode 3: Not distinguishing error types

Treating substitution, insertion, deletion, and formatting errors as equivalent produces the wrong remediation strategy for at least three of the four types. If a QA report says "12 errors this month" without specifying what kind of errors, the report does not enable any remediation action. If all 12 are substitution errors on product names, the fix is one glossary update. If they are 3 substitutions, 6 insertions, and 3 formatting errors, the fix is three different remediation actions targeting three different root causes.

Diagnosis: Review the last 3 months of QA reports. If the error breakdown by type is missing, add it to the QA template. The error type distribution should be the first data point after the accuracy percentage.

Failure mode 4: No root-cause step

Correcting a clip without identifying why the error occurred means the same error will appear in the next clip on the same topic. This is the most common reason teams that are actively QA-ing still see the same errors reappear month after month — they are closing individual errors without closing the root cause. A QA process without a root-cause step is a ratchet that never advances: each batch produces errors, the errors are corrected, the next batch produces the same errors, and nothing improves.

Diagnosis: Pull any term that has appeared as a substitution error in three or more QA cycles. If it is still generating errors, the root cause has not been addressed. Run the five-step triage protocol on that term and identify the unresolved root cause.

Failure mode 5: Glossary not updated after QA

This is the most common failure mode and the one most directly responsible for the 94–96% accuracy plateau. The QA process generates substitution-error data — terms that are in the content but not correctly in the glossary. If that data does not route to the glossary as an update, every subsequent session on the same topic starts with the same vocabulary gap that caused the original failure. The feedback loop only runs if corrections feed back. If QA corrections are documented in a log that no one reads, the loop is broken at the last step.

Diagnosis: Count how many unique substitution errors appeared in your QA data last month. Count how many glossary entries were added last month. If the glossary addition rate is substantially lower than the substitution error rate, corrections are not routing to the glossary. Map the correction-to-glossary pathway and identify where it breaks.

Failure mode 6: New content types not triggering targeted QA

When a new course topic, a new department, a new instructor, or a new production team produces their first batch of captioned content, it should automatically trigger a Tier 2 targeted QA pass — not a Tier 1 spot-check. New content introduces new vocabulary that may not be in the glossary, new speakers with different pronunciation patterns, and new recording environments with different audio characteristics. Running only a Tier 1 spot-check on new-domain content means the first batch goes into the LMS with undetected systematic errors that will be discovered (by a learner complaint or an audit) rather than prevented by QA.

Diagnosis: Review the production calendar for the past 6 months. Identify every batch that introduced a new vocabulary domain. Check whether those batches received Tier 2 or Tier 1 QA. If Tier 1, pull the QA reports and compare against any subsequent corrections to those clips — you may find systematic errors that were missed at the initial QA pass.

Failure mode 7: Pass rate reported, not error breakdown

A monthly QA report that says "98.7% pass rate" is not useful for driving improvement. A report that says "substitution errors: 14 (11 product names missing from glossary, 3 pharmaceutical names with wrong phonetic entries); insertion errors: 2 (both from room-audio clips in the HR conference room production environment); formatting errors: 3 (speaker ID missing on two-speaker scenario clips)" is useful. The first report shows that something is off but provides no lever for improving it. The second report provides three specific remediation actions that will measurably reduce the error count in the next batch.

Diagnosis: Look at the last five QA reports. If the only number is the accuracy percentage, the reports are compliance documentation without operational value. Add error breakdown by type as a required field in the QA report template.

Failure mode 8: Confusing QA with manual review

If reviewers are correcting every word in every caption file before they are considered "reviewed," that is not QA — that is manual transcription. Manual transcription of every clip is a reasonable approach for very short-form content (30-second onboarding clips) or for compliance-critical content where a single error is disqualifying. But as a general workflow for a large content library, manual transcription of every clip is not scalable and is not what QA is designed to be.

The tell-tale sign of this failure mode is that the QA "accuracy rate" is always very close to 100% — because every error was corrected before scoring. The measured QA score on a manually corrected clip is not the accuracy rate of the captioning process; it is the accuracy rate of the manual correction process. The accuracy rate of the captioning process (which is the number that drives improvement decisions) is the score before correction. If you are only scoring after correction, you have no signal on where the model is underperforming.

Diagnosis: Check whether the QA process scores clips before or after editorial correction. If QA always runs on already-corrected clips, implement pre-correction QA as a separate step. The pre-correction score is the diagnostic input to the improvement system; the post-correction score is the compliance documentation output.

Connecting QA to the wider caption operations stack

Caption QA does not operate in isolation. It is one component of a four-part caption operations stack:

  1. Glossary architecture — the vocabulary model that determines the starting accuracy for new content. The glossary architecture guide covers term sourcing, taxonomy, ingestion, and the compound accuracy effect.
  2. Captioning process — the production workflow that generates caption files from audio input. This involves the choice of caption vendor, the LMS ingestion workflow, and the format requirements (SRT, VTT, TTML) for each platform.
  3. Caption QA — the measurement process described in this post. It produces the accuracy score, error breakdown, and root-cause hypotheses that feed the improvement system.
  4. Feedback loop — the correction routing system that translates QA output into glossary updates. The feedback loop guide describes how corrections compound accuracy over time, why the switching cost accumulates, and what the three phases of the compounding trajectory look like.

The QA process is the diagnostic layer of this stack. It tells you how the captioning process is performing against the standard, which part of the vocabulary model is causing failures, and whether the feedback loop is routing corrections effectively. Without QA, the other three components operate without feedback — you cannot improve what you do not measure.

The LMS caption audit methodology describes how to apply this QA process retrospectively to an existing library — a common requirement after an accessibility audit or DOJ complaint reveals that historical content needs to be reviewed. The audit post covers the sampling framework for large libraries, the prioritization model for remediation, and the 5-day sprint plan for getting a library to a defensible compliance posture from a standing start.

If you are benchmarking your current caption vendor against the 99% standard and are seeing consistent failures, the captioning vendor RFP playbook and the comparison pages for Rev, 3Play Media, and Verbit provide the accuracy benchmarks and evaluation criteria to inform a vendor change decision. The Whisper accuracy benchmarks by vertical give baseline expectations for what a glossary-enhanced model should achieve on specific content types before QA corrections are applied.

Frequently asked questions

How many clips should I QA per batch?

The right sample size depends on batch size and content category. For a batch of 20 clips in a mature vocabulary domain (content type you have been captioning for 6+ months, glossary well-seeded), a 10% spot-check — 2 clips, 3 samples per clip — is sufficient to detect systematic errors while maintaining throughput. For a batch of 5 clips in a new content domain, QA all 5. For compliance-classified content (OSHA, HIPAA, ADA) regardless of batch size, Tier 3 full review is recommended before the first publish. The general principle: new domains and compliance content get more QA intensity than familiar domains and soft-skills content.

What is the minimum sample length for a DCMP-protocol spot-check?

The DCMP Captioning Key specifies 5-minute samples as the standard. For practical application in a production pipeline, a 3-minute sample is a workable minimum — at typical training-video speech rates (120–150 words per minute), a 3-minute sample produces 360–450 words, which is enough to score at the word level with statistical reliability. Below 2 minutes (200–240 words), the sample is too small: a single multi-word substitution error (like "Kubernetes" → "Cube Nettis") shifts the score by nearly a full percentage point on its own, making the measurement noisy. Two minutes is the hard minimum; three is preferred; five is the protocol standard.

Should I QA clips before they go into the LMS or after?

Pre-publication QA is strongly preferred — it is the only way to prevent non-compliant content from being served to learners in the first place. Post-publication QA (checking content that is already live) is an audit activity, not a production gate. In practice, many organizations do both: a pre-publication QA gate on new content batches and a periodic post-publication audit on the existing library (see the audit methodology guide). If you can only run QA in one place, run it pre-publication — the downstream remediation cost of fixing content that has been live, linked, and cited in accessibility documentation is substantially higher than catching failures at the gate.

What if a clip is hard to understand due to bad audio — does that affect the score?

Yes, and this is one of the most common complaints about DCMP-protocol QA. The protocol counts deletions and substitutions caused by bad audio the same as it counts vocabulary-gap substitutions — the accuracy percentage reflects the quality of the captions served to learners, not the quality of the audio. A learner who cannot hear the source audio and is relying on captions for access does not benefit from knowing that the captions were bad because the audio was bad. If a clip's audio quality is insufficient to support 99% caption accuracy, the right response is to fix the audio (re-record, noise-reduce, re-mix) rather than accept a failing caption score. If re-recording is not feasible, the clip should be flagged in the accessibility statement as not meeting the WCAG threshold and the limitation should be disclosed.

We have 2,000 captioned videos in our LMS — where do we start?

Start with the five-dimension audit framework: triage by monthly view count, compliance classification, and caption source. The 200 most-viewed clips that are compliance-classified (OSHA, HIPAA, ADA, EEO, etc.) are your highest-priority QA targets. If those clips were auto-captioned and have never been QA'd, start there. A 5-day sprint plan for getting the high-priority subset to 99% is described in the audit methodology guide. For the remaining 1,800 clips, a risk-stratified sampling approach — QA 10% per year, prioritizing by view count and compliance classification — allows you to work through the library systematically without the process consuming the entire L&D budget.

Can I use YouTube's accuracy report as a QA tool?

YouTube's auto-generated accuracy statistics are not a protocol-based QA tool. YouTube does not report word-level accuracy against the DCMP standard — it reports a model confidence score that is calibrated against YouTube's general-English benchmark, not your organization's training vocabulary. A YouTube "accuracy" rating of "good" on an engineering onboarding video typically corresponds to a DCMP-protocol score in the 86–92% range — passing YouTube's internal benchmark and failing the WCAG compliance threshold at the same time. Use the DCMP protocol on your actual caption files, not YouTube's confidence metric, for any compliance-relevant measurement.

How do I write a QA finding that triggers a vendor correction?

A vendor correction request should include: the clip identifier, the timestamp range of the sampled passage, the full list of errors by type (substitution/insertion/deletion/formatting), the correct text for each error, the measured accuracy score, and the threshold the clip failed to meet. A clear format: "Clip [ID] failed QA at [X]% against the DCMP 99% threshold. Errors: [list]. Please correct the caption file and return a revised version within [N business days]. Note that the following substitution errors indicate vocabulary gaps in the glossary: [terms]. Please confirm whether these terms have been added to the per-customer glossary." Framing substitution errors as glossary gaps in the vendor request is important — it signals that you expect a systemic fix, not just individual clip corrections.

Run your first DCMP spot-check today

The DCMP spot-check protocol described in this post can be applied to any caption file in 20 minutes. Pick one clip from your LMS — ideally a compliance-training or product-training clip that has never been formally scored — download the .SRT or .VTT file, extract the text, and count the first 300 words against the audio. The score you get is the baseline. If it fails, the error breakdown tells you why. If it passes, you now have your first QA-protocol data point for your accessibility documentation.

GlossCap's glossary-enhanced captioning is designed to start above 91% accuracy and improve toward 99% as the feedback loop matures — so that QA is spending more time verifying passes than diagnosing failures. The Caption Mangle Scanner shows the before/after on a 4-minute sample from your specific vocabulary domain. See the pricing page for team plans that include glossary management and a correction-routing pipeline.

Made in the Startup Factory · other tools: