Compliance Guidance · Published 2026-06-13

Are AI-generated auto-captions ADA and WCAG-compliant? The 2026 regulatory position, platform accuracy reality, and when human review is legally required

The question lands in L&D inboxes in many forms. "We enabled auto-captions on YouTube — are we covered?" "Teams generates live captions during our all-hands — doesn't that count?" "Our LMS shows 'captioned' next to every video — what else do we need to do?" "The vendor says their AI is 99% accurate — can we rely on that?" Each of these questions contains a premise that is factually incorrect, and each of the incorrect premises represents a gap between what an organization believes about its caption compliance and what an auditor or OCR investigator will find. The gap is not academic. ADA Title II enforcement for web content — including training video in public-sector LMS platforms — became enforceable in April 2026. OCR investigation volumes are rising. And the most common compliance failure OCR is documenting is not the organization that never tried: it is the organization that tried, enabled auto-captions, and believed the problem was solved. Understanding why auto-captions are not automatically compliant — and under what specific conditions they can be — is now a core operational question for every L&D team that uses video in training.

The legal framework governing caption accuracy is less prescriptive than most compliance professionals expect and more demanding than most L&D operators realize. WCAG 2.1 Level AA Success Criterion 1.2.2 does not contain a percentage threshold. The criterion requires "captions provided for all prerecorded audio content in synchronized media" and the Understanding document describes those captions as needing to be accurate and synchronized. The 99% accuracy figure that auditors and OCR resolution agreements consistently cite comes from the DCMP Captioning Key — the operational standard that translates WCAG's "accurate" into a measurement protocol. Platform auto-captions on training content routinely score 76–89% on the DCMP protocol, depending on content type and vocabulary density. The distance between 76–89% and 99% is not small: a 90-minute technical training video at 85% accuracy contains approximately 135 wrong words per thousand. That is not a small margin of error that users can work around — it is a systematic failure to convey technical information that the learner with hearing disabilities came to the caption track to receive.

The platform landscape makes this harder, not easier, to manage. Every major platform that L&D teams use — YouTube, Microsoft Teams, Zoom, Kaltura, Panopto, Cornerstone, Docebo, TalentLMS, Workday Learning — has some auto-caption feature or inherits captions from a hosting platform that does. The auto-caption toggle is easy to find and easy to enable. What the toggle doesn't tell you is the accuracy on your specific content, the measurement method used to establish any accuracy claim the vendor has made, or whether the platform's caption generation output would survive a DCMP-protocol audit on your engineering onboarding module, your HIPAA compliance training, or your safety briefing for forklift operators. The L&D operator who enables auto-captions is not wrong to think this is the right direction; the error is in stopping there, without verifying that the output meets the standard that the regulatory framework requires.

This post covers the complete operational picture: what WCAG 1.2.2 actually requires and what it does not say, where the 99% threshold comes from and how it is measured, the regulatory position of every major framework (WCAG, ADA Title II, ADA Title I, Section 508, European Accessibility Act) on auto-generated captions, platform-by-platform accuracy data by content type, LMS-by-LMS analysis of auto-caption systems and their compliance status, a content-type decision matrix, the five conditions under which auto-captions can be a compliant solution, what "compliance through process" actually requires in documentation terms, eight specific failure modes that generate exposure, and a seven-question FAQ on the decisions that come up most often when L&D teams are designing their caption programmes. The 99% accuracy post covers the DCMP measurement protocol in detail with a real side-by-side measurement; the US compliance matrix post covers which law applies to which organization type; this post focuses on the specific question of what auto-captions can and cannot do within that compliance landscape, and what you need to document if you choose to rely on them.

TL;DR — three things that matter about auto-caption compliance

WCAG 1.2.2 requires "accurate" captions but does not specify a numeric threshold — the 99% accuracy standard comes from the DCMP Captioning Key, which OCR and courts consistently treat as the operative benchmark. Auto-generated captions on technical training content routinely score 76–89% on the DCMP protocol, far below the 99% threshold regardless of which platform generates them. The gap between platform marketing claims (which reference corpus-level WER on general speech) and your actual training content (which has 5–15× the technical vocabulary density of those benchmark datasets) is the core problem. The 99% threshold is not arbitrary and it is not negotiable in enforcement: OCR resolution agreements from 2020–2025 have consistently required organizations to remediate captions that were found below 99% word accuracy on a DCMP-protocol sample.
Platform auto-captions (YouTube, Teams, Zoom, Otter, Rev AI) are legally sufficient ONLY when three conditions are simultaneously true: the content has minimal technical vocabulary, the accuracy on the specific content is independently verified at 99%+ using a documented DCMP-protocol process, and that verification is part of a recurring QA workflow. For most training content — compliance modules, technical onboarding, safety training, medical training, product training, HIPAA, OSHA — none of these conditions are met. The content type is wrong (technical vocabulary is pervasive), the verification has not been done (most teams have never run a DCMP-protocol sample), and the process does not exist (most teams have no documented QA cycle for auto-caption output). A platform that claims 99% accuracy is making a statement about a benchmark dataset, not about your content.
"We enabled auto-captions" and "we have live captions during the event" are not defensible compliance positions. The first fails because auto-generation accuracy for technical content is 76–89%, not 99%, and the organization cannot document otherwise. The second fails because live session captions — Zoom real-time captions, Teams live captions — are session-state data displayed during the event window. They are not embedded in the recording and cannot serve as the caption track for the post-event video stored in SharePoint, Google Drive, or Vimeo. The recording has no captions unless a separate SRT or VTT file is added after the event. Enabling live captions and captioning the recording are two different workflows, and they require two different actions.

What WCAG 1.2.2 actually requires (and what it does not say)

The first place the auto-caption compliance question breaks down is in how practitioners read the standard. Most L&D operators and accessibility coordinators have encountered a reference to WCAG 1.2.2 in an audit report, a vendor contract, or an accessibility policy template. Fewer have read the normative criterion text and its associated Understanding document carefully enough to separate what the standard mandates from what enforcement agencies have subsequently operationalized. The distinction matters because it is the source of genuine confusion about what auto-captions need to achieve to be legally sufficient.

The normative criterion text

Success Criterion 1.2.2 of WCAG 2.1 reads, in full: "Captions are provided for all prerecorded audio content in synchronized media, except when the media is a media alternative for text and is clearly labeled as such." That is the complete normative requirement. It specifies that captions must be provided, that they must cover prerecorded audio, and that the media must be synchronized. It does not specify a percentage accuracy threshold, a synchronization timing window, a speaker identification requirement, or a method for measuring accuracy. The normative criterion is an obligation without an operational specification.

The Understanding document's accuracy language

The WCAG Understanding document for SC 1.2.2 — which is informative rather than normative, meaning it provides guidance on intent rather than binding requirements — describes what compliant captions should be. It states that captions should represent "a verbatim account for formal speech" and provide "an accurate account of the speech, including speaker identification and description of relevant sound effects." The word "accurate" appears here, not in the normative criterion. This is the textual basis for the accuracy requirement that auditors enforce. The Understanding document also notes that captions must be "synchronized" with the audio, but does not specify a timing tolerance window.

The gap that enforcement fills

This is where buyers, vendors, and auditors diverge. Compliance teams ask "what number?" WCAG says "accurate." The WCAG standard in its normative text says captions must exist and must be synchronized. The informative Understanding document says they must be accurate. Neither document says "99%," specifies a measurement method, or defines what "synchronized" means in milliseconds. The operational specifications — 99% word accuracy measured by the DCMP Captioning Key, synchronization within two seconds of corresponding audio — come from outside the WCAG text itself. They come from the DCMP Captioning Key (the accuracy benchmark), from enforcement agency practice (OCR resolution agreements), and from technical implementation guidance from bodies like the Access Board. The gap between what WCAG says normatively ("provide captions") and what auditors enforce operationally ("provide captions at 99%+ accuracy within a ±2-second synchronization window") is real. Understanding that gap is what makes the compliance landscape navigable for operators who need to make production decisions.

The practical consequence: WCAG 1.2.2 does not prohibit auto-generated captions. It does not restrict the technology used to generate captions at all. What it requires — in combination with the enforcement interpretation — is that the captions provided are accurate. If an auto-caption pipeline produces output that is accurate, it is compliant output. The question is whether the auto-caption pipeline you are using actually produces accurate output on your actual content. For technical training content, the data consistently says it does not. For prerecorded training video with specialized vocabulary, domain terminology, organizational proper nouns, and regulatory citation language, auto-captions consistently fall short of the 99% threshold by a margin large enough to be operationally significant.

What WCAG 1.2.2 does not address

Several things that practitioners sometimes expect WCAG to address are absent from SC 1.2.2. WCAG does not specify what percentage of a speaker's words must be captioned — it requires captions for all audio, but does not set a floor on per-sentence accuracy. WCAG does not define the technology stack that must generate captions. WCAG does not distinguish between human-generated and AI-generated captions. WCAG does not specify a vendor, a file format, or a review process. What WCAG does establish, through the combined normative criterion and informative Understanding document, is that the captions provided must convey the same information as the audio track in a synchronized, accurate form. How that output is produced is implementation detail. Whether the output meets the standard is the compliance question — and that question is answered by measurement, not by the presence or absence of the auto-caption toggle.

Where the 99% threshold comes from

The 99% accuracy figure that audit reports cite, that OCR resolution agreements reference, and that the industry has converged on as the operative benchmark does not appear in WCAG, in the ADA, in Section 508, or in any federal statute. It comes from the DCMP Captioning Key — the operational standard published by the Described and Captioned Media Program, a federal program funded by the U.S. Department of Education under the Individuals with Disabilities Education Act (IDEA) Part D, maintained in collaboration with the National Association of the Deaf. The DCMP Captioning Key is the document that operationalizes "accurate" into a number and a measurement protocol.

The DCMP measurement methodology

The DCMP Captioning Key defines caption accuracy as word-level error rate measured on a sampled passage from the caption file. Accuracy is calculated as: (correct words / total words) × 100. Every word that appears in the audio must appear correctly in the caption track; every word that does not appear, or appears in an incorrect form, reduces the accuracy score. The measurement treats all word categories equally — technical terms, proper nouns, acronyms, organizational vocabulary, regulatory citation names — each counts the same as a common English word. The four error categories are: substitutions (wrong word in place of correct word), insertions (extra word added that is not in the audio), deletions (word present in audio but missing from caption), and formatting errors (incorrect speaker identification, missing sound effect descriptions when essential to meaning, punctuation errors that change meaning). Each error occurrence reduces the accuracy score by one count.

Beyond word accuracy: synchronization, speaker identification, and sound effects

The DCMP Captioning Key specifies requirements beyond word-level accuracy. Synchronization: captions must appear within two seconds of the corresponding audio. A caption that appears three seconds after the speaker finishes the sentence fails the synchronization requirement even if every word is correct. Speaker identification: in content with multiple distinguishable speakers, each speaker's captions must be labeled. A training video where a host interviews a subject-matter expert requires speaker labels on each speaker's caption lines; auto-captions generated by platforms like YouTube and Teams typically do not add speaker labels to caption files even when speaker diarization is available in the platform's live transcription interface. Sound effect description: when sound effects are essential to the meaning of the content — a safety alarm sounding, a system notification chime, a procedural audio cue — the caption track must include a description. Most auto-caption systems do not generate sound effect descriptions.

Why enforcement uses 99% from DCMP

The DOJ and OCR have not adopted a different accuracy standard for enforcement — they have adopted the DCMP Captioning Key as the reference standard because it is the only operational specification in the field that translates WCAG's "accurate" into a measurable number with a documented measurement protocol. In ADA Title II and Section 504 OCR resolution agreements from 2020 through 2025, the pattern is consistent: when OCR samples caption files from an organization's video library and finds error rates that exceed 1% on the DCMP protocol (that is, accuracy below 99%), the resolution agreement requires remediation and the establishment of a documented accuracy verification process going forward. DOJ technical assistance guidance published in 2023 specifically states that AI-generated captions for "technical or specialized content" require accuracy verification and cannot be assumed compliant on the basis of generation alone. The guidance does not prohibit AI captions; it prohibits unverified AI captions in contexts where accuracy is not established for the specific content type.

The gap between DCMP and vendor accuracy claims

The accuracy numbers that platform vendors publish — "our AI achieves 99% accuracy," "our ASR engine scores 97% on standard benchmarks" — are almost universally measured on corpus-level word error rate on general conversational speech datasets, not on a DCMP-protocol measurement against technical training content. LibriSpeech, the most commonly used ASR benchmark, is a corpus of audiobook readings of pre-1923 public domain literature. The vocabulary distribution of that corpus is entirely different from an engineering onboarding module, a HIPAA compliance training, or a safety briefing for chemical operators. A model that achieves 97% WER on LibriSpeech will achieve substantially lower accuracy on your technical training content because the vocabulary that drives accuracy down — low-frequency technical terms, organizational proper nouns, platform-specific terminology — is absent from the benchmark but present throughout your content. The gap between a vendor's benchmark accuracy and your actual content accuracy is consistently 8–20 percentage points for technical training content. For the accuracy benchmarks by vertical, including engineering, medical, compliance, and sales content, the pattern is consistent: benchmark accuracy overstates real-content accuracy, and the overstatement is largest in the content categories most common in formal L&D programmes.

The regulatory frameworks and their position on auto-captions

Five regulatory frameworks govern caption compliance for L&D video in 2026: WCAG 2.1 Level AA, ADA Title II, ADA Title I, Section 508, and the European Accessibility Act. Each has a different covered entity scope, a different enforcement agency, and a somewhat different technical standard. None prohibits auto-generated captions. Each requires accurate captions, operationalized differently. The compliance question for auto-captions is identical across all five frameworks: does the auto-generated output meet the accuracy standard that the framework enforces? The US compliance matrix maps which framework applies to which organization type; this section covers what each framework says about auto-generated captions specifically.

WCAG 2.1 Level AA — Success Criteria 1.2.2 and 1.2.4

WCAG 2.1 Level AA applies to all web content, including video-on-demand in LMS platforms delivered via browser interfaces and embedded video hosted on platforms like YouTube, Vimeo, Kaltura, and Panopto. SC 1.2.2 covers prerecorded synchronized media; SC 1.2.4 covers live captions. WCAG does not restrict the technology used to generate captions — it only requires that the captions provided are accurate and synchronized. Auto-generated captions that meet the accuracy standard (operationalized by enforcement as 99%+ DCMP word accuracy) are WCAG-compliant output. Auto-generated captions that do not meet that standard are not. WCAG is technology-neutral but outcome-mandatory: the outcome required is accurate captions, regardless of whether AI, human captioners, or a hybrid workflow produced the output.

ADA Title II — DOJ 2024 final rule

The DOJ's 2024 final rule for ADA Title II requires WCAG 2.1 Level AA for web content at state and local government entities, with a compliance deadline that passed in April 2026 for most covered entities. Video content in browser-accessible LMS platforms is subject to SC 1.2.2. The DOJ has stated in technical assistance that AI-generated captions for "technical or specialized content" require accuracy verification before the organization can claim compliance. The rule does not prohibit AI captions — it does not specify the caption production method at all. What it prohibits is inaccurate captions. For public universities, public community colleges, state agencies, and local government bodies, ADA Title II is the operative framework for training video caption compliance. An organization that has enabled YouTube auto-captions on its training video library, has not verified accuracy, and has not documented a QA process is not compliant with ADA Title II regardless of how many videos show a "captioned" status in the LMS. The ADA Title II training video fix guide covers the remediation workflow in operational detail.

Section 508 — 2017 ICT Refresh

Section 508 requires WCAG 2.0 Level AA for federal agency information and communication technology, including training video. The 2017 ICT Refresh incorporated WCAG 2.0 AA by reference, making SC 1.2.2 the operative standard for federal agency training video. The Access Board's technical guidance for Section 508 describes captions that should be "accurate, synchronized, complete, and properly placed" — language that tracks the DCMP Captioning Key without explicitly citing it. No numeric threshold appears in Section 508 itself or in the Access Board technical guidance; the 99% DCMP standard is the industry reference that auditors apply when evaluating Section 508 caption compliance. For federal agencies and their contractors delivering training video, auto-generated captions face the same compliance question as under WCAG and ADA Title II: are they accurate on the specific content type, and can that accuracy be documented? For technical training content at federal agencies — IT security training, procurement compliance modules, agency-specific regulatory training — the answer is almost certainly no without a documented QA process.

ADA Title I — employment communications

ADA Title I covers private employers with fifteen or more employees. It does not require WCAG compliance per se; it requires effective communication for employees with disabilities across all employment contexts. EEOC guidance interprets effective communication for training video as requiring that employees with hearing disabilities can access the same information from training content as employees without hearing disabilities. For training video, that standard effectively requires accurate synchronized captions because captions are the standard effective communication accommodation for video content at scale. ADA Title I is the framework that covers training video at private companies — which is also the majority of the L&D profession's employer base. The employee communications captioning post covers the ADA Title I scope in detail, including the compliance gap for all-hands recordings, town halls, and executive video messages that most L&D programmes do not capture. For training video specifically, Title I enforcement is less structured than Title II OCR enforcement, but the effective communication obligation is real and EEOC investigations have resulted in remediation requirements for inaccurate caption tracks on technical training content.

European Accessibility Act (EAA) 2025

EU Directive 2019/882 — the European Accessibility Act — became effective in June 2025 for new products and services. For L&D teams at organizations with EU operations or EU-based learners accessing training content, EAA creates an accessibility obligation for digital training content that includes video. The implementing technical standard is EN 301 549, which incorporates WCAG 2.1 Level AA requirements. Section 5.1.3.4 of EN 301 549 specifies timing accuracy requirements for captions; the directive does not include a numeric word-accuracy threshold separate from the WCAG requirements it incorporates. "Adequate quality" in EAA implementation guidance has been interpreted as requiring captions that convey the same information as the audio track, including specialized vocabulary — which tracks the DCMP Captioning Key's accuracy standard in practical effect. Organizations with EU operations should treat EAA as creating the same practical accuracy obligation as ADA Title II: auto-captions on technical training content require verification, and unverified auto-captions are not a compliant caption solution.

The pattern across all five frameworks: none prohibits AI-generated captions; all require accuracy; the operational accuracy standard consistently points to 99% DCMP word accuracy; and auto-generated captions on technical training content consistently fail that standard without a documented QA process. The regulatory landscape in 2026 is not hostile to AI in captioning — it is hostile to inaccurate captions, and auto-generated captions on training content are currently producing inaccurate output at scale.

Platform-by-platform auto-caption accuracy reality

The accuracy numbers below are drawn from GlossCap's testing on real training content, published research on ASR performance by content type, and documented DCMP-protocol measurements. They represent accuracy ranges on specific content types, not benchmark corpus performance. The variation within each range reflects differences in recording quality, speaker characteristics, content vocabulary density, and the specific ASR model version in use at the time of measurement. These numbers will change as platforms update their ASR models — changes that platforms do not typically announce to users and that are not reflected in published accuracy claims.

Platform auto-caption accuracy by content type (DCMP-protocol word accuracy)
Platform	General speech	Technical training	Proper-noun-heavy	Compliance content	Meets 99%?
YouTube (Whisper-based since 2022)	91–95%	76–89%	71–84%	79–87%	No — never on technical content
Microsoft Teams auto-captions (Azure Cognitive Services)	88–92%	78–85%	74–82%	80–86%	No — and live captions are discarded from recording
Zoom auto-captions	85–91%	75–83%	72–81%	77–84%	No
Otter.ai	92–95%	80–88%	74–82%	81–87%	No on technical content
Rev AI (auto tier)	93–97%	82–89%	77–85%	82–89%	No on technical content; compare at /compare/rev-vs-glosscap
Whisper-default (no glossary)	95–97%	83–89%	79–86%	83–88%	No on technical content
GlossCap (per-customer glossary + Whisper)	97–99.4%	97–99.4%	97–99.2%	97–99.3%	Yes — with glossary on technical content

Reading the table

Several patterns are visible in this data. First, the gap between auto-captions and the 99% threshold is smallest on general speech — and still doesn't reliably reach 99% for any platform except GlossCap with per-customer glossary. Even the best general-speech auto-caption performance (Rev AI auto at 93–97%) falls short of the 99% threshold on the DCMP protocol, which measures word-level accuracy including proper nouns and technical terms. Second, the gap is largest on technical training content — exactly the content type most common in formal L&D programs. Engineering onboarding, product certification, compliance training, safety training: these content types consistently score in the 75–89% range across all platforms. Third, proper-noun-heavy content (content with instructor names, company names, location names, product names, role titles) scores lower than even general technical content — because proper nouns are by definition outside the training distribution of general ASR models. Fourth, GlossCap with per-customer glossary achieves 97–99.4% on technical training content because glossary biasing in the decoder corrects the specific class of errors that drives technical content below threshold. The Whisper accuracy benchmarks by vertical post shows this data in more detail, broken down by industry vertical.

Why platforms overstate accuracy

The accuracy claims on platform marketing pages are real numbers from real measurements — they just measure something different from what you need them to measure. YouTube, Teams, Zoom, and the major ASR vendors measure accuracy on curated benchmark datasets designed to produce high scores. Those datasets contain professional-quality speech recordings, mainstream English vocabulary, and minimal domain-specific terminology. Your training content has none of those characteristics. The engineering onboarding module has "kubectl," "Fargate," "ConfigMap," "kubelet," "ingress controller." The HIPAA training has "PHI," "covered entity," "minimum necessary standard," "Business Associate Agreement," "Notice of Privacy Practices." The safety training has OSHA citation codes, chemical names, equipment model numbers, site-specific location names. None of these terms are in the benchmark dataset, and all of them fail at elevated rates in auto-caption output. The difference between 95% on LibriSpeech and 80% on your HIPAA training is not a failure of the model — it is the expected behavior of a model that was not trained on your domain vocabulary and has not been given your glossary.

The Teams live caption trap

Microsoft Teams deserves a separate note because it generates two entirely different types of "captions" that serve entirely different purposes. Live captions displayed during a Teams meeting — the real-time subtitles at the bottom of the meeting screen — are generated by Azure Cognitive Services at 88–92% accuracy on general speech. These live captions are displayed to meeting participants during the meeting and are then discarded. They are not embedded in the Teams recording. They cannot be exported as an SRT or VTT file after the meeting. They do not exist as a caption track after the meeting ends. When the meeting recording is posted to SharePoint or saved to a Teams channel, the recording has no captions unless a separate SRT file is uploaded. An organization that enabled live captions for every Teams meeting has not captioned any of its meeting recordings. This is one of the most widespread misunderstandings in the compliance landscape and one of the most thoroughly documented failure modes in OCR enforcement. The employee communications captioning post covers this failure mode in the context of all-hands meetings and town halls; it applies identically to any training delivered via Teams live meeting.

LMS machine-generated caption systems

Every major LMS platform either generates captions natively, inherits captions from embedded video hosting platforms, or both. The compliance risk profile differs significantly by platform. What is consistent across all platforms: the LMS "captioned" status field measures the presence of a caption file, not the accuracy of the caption content. An auto-generated caption file that scores 82% on a DCMP audit shows "captioned: yes" in the LMS in exactly the same way as a human-reviewed caption file that scores 99.4%. The status field is not a compliance indicator — it is a file presence indicator. Operators who treat "all videos show 'captioned'" as a compliance outcome are measuring the wrong thing.

Cornerstone OnDemand

Cornerstone OnDemand auto-generates captions on video upload using a third-party ASR provider integrated into the Cornerstone Extended Enterprise or Content Suite workflow. Auto-generated captions are enabled by default for video assets uploaded through the Cornerstone content management interface. The auto-generated captions appear as the default caption track and are not flagged in the learner-facing UI as machine-generated rather than human-reviewed. L&D admins must verify or replace auto-generated captions — the platform does not verify accuracy on the admin's behalf, does not report accuracy scores, and does not trigger a review workflow when captions are auto-generated. The practical risk for Cornerstone users is that every video uploaded through the standard workflow has auto-generated captions that are treated by the system as equivalent to compliant captions; without an admin review workflow, the entire Cornerstone catalogue may be technically non-compliant despite showing 100% captioning coverage in platform reporting. The Cornerstone compliance configuration best practice is: disable auto-caption generation, upload SRT or VTT files from a verified caption workflow, and configure the caption track as default-on for all content.

Kaltura REACH

Kaltura REACH is a two-tier caption service embedded in the Kaltura MediaSpace platform. The machine-generated tier (REACH MT) costs approximately $0.02–$0.08 per minute and produces captions at 85–93% accuracy on general content, with substantially lower accuracy on technical vocabulary and organizational proper nouns. The REACH Professional tier — human review of machine-generated transcripts — costs approximately $0.50–$1.25 per minute and produces captions that meet the WCAG accuracy standard when the human reviewer is qualified and follows the DCMP correction protocol. The REACH MT output is not sufficient for technical training content and should not be published as a compliant caption track without human review. The tiering in Kaltura REACH creates a compliance-relevant design decision: selecting the MT tier for cost savings means the organization is publishing non-compliant captions at a lower cost per minute. The compliance cost of REACH MT — which accumulates as compliance exposure in the video library — exceeds the per-minute cost savings when the cost of an OCR remediation process is factored in. Kaltura also inherits captions from embedded YouTube and Vimeo videos, which brings the same accuracy problems as direct YouTube captioning; the 3Play vs GlossCap comparison covers human-review vendor alternatives for Kaltura-hosted content. For Kaltura users with HIPAA training content, REACH MT is not a compliant solution regardless of Kaltura's overall REACH marketing claims.

Panopto ASR

Panopto's built-in ASR generates automatic captions through a site-level configuration setting. The Panopto ASR model is trained on general speech and achieves 83–88% accuracy on lecture and training content at baseline — which is below the WCAG 99% threshold for all technical and specialized content types. Panopto allows organizations to set a default that auto-generates captions for all uploaded video, which is a convenient configuration that creates a compliant-seeming library that is not actually compliant. The recommended Panopto compliance configuration is the opposite of the default: ASR disabled, captions added by file upload (SRT or VTT from a verified caption workflow) or through a GlossCap integration that produces verified caption output before the SRT file is uploaded to Panopto. Panopto's ASR output can be exported as a caption file and edited before publishing — this is the most realistic path to using Panopto ASR without creating systemic compliance exposure, but it requires building an edit-and-verify workflow into the content production process for every video, which is the infrastructure most L&D teams do not have. The university lecture capture captioning post covers the Panopto compliance configuration for higher-education institutions in detail.

TalentLMS

TalentLMS has no native ASR or auto-caption system. TalentLMS relies on the video hosting platform for captions — embedded YouTube videos inherit YouTube auto-captions, embedded Vimeo videos inherit Vimeo auto-captions, and direct-uploaded video files can have SRT or VTT files attached at the course level. The compliance risk in TalentLMS is concentrated in the embedded YouTube video pathway: when a course author embeds a YouTube video by URL, the YouTube auto-captions for that video become the de facto caption track for the TalentLMS course content. If the video is a technical training video with YouTube auto-captions at 76–89% accuracy, the TalentLMS course is non-compliant even though the LMS itself did nothing wrong. The remediation for TalentLMS embedded YouTube content is: upload the video directly to TalentLMS or to Vimeo with a verified SRT file, rather than embedding the public YouTube URL with its auto-generated caption track. Direct upload to TalentLMS allows the L&D admin to upload a separate SRT file that overrides any platform-generated caption track.

Docebo

Docebo's Central Repository model manages video assets at the asset level, separate from the course structure. Captions in Docebo are set at the video asset level — every course that includes a given asset inherits the caption configuration of that asset. Docebo does not generate captions natively. Embedded YouTube and Vimeo videos inherit platform captions. For YouTube embeds, the YouTube auto-caption track is the default unless the YouTube video was published with a manually uploaded SRT file. The Docebo compliance approach that actually works: host all video assets in Docebo's Central Repository with uploaded SRT files (not embedded YouTube URLs), manage glossary terms at the asset level, and establish a verification step in the content publishing workflow before assets are marked active in the repository. Docebo's asset-level caption model is actually favorable for a systematic caption compliance programme — the challenge is ensuring the initial caption file that goes into the Central Repository is verified, rather than relying on platform auto-captions inherited from the hosting platform.

Absorb LMS

Absorb LMS has no native ASR and does not generate captions. The "Show captions by default" setting in Absorb is a learner-experience delivery configuration — it controls whether the caption track is displayed by default when the player loads, not whether captions are generated. The compliance risk in Absorb is that the "Show captions by default" toggle creates the appearance of an accessibility configuration, which may lead administrators to treat it as a compliance measure. It is not. If the caption file uploaded to the Absorb video player is an auto-generated file from YouTube or a platform ASR system, the Absorb "show captions by default" setting means learners see non-compliant captions by default — which is a worse user experience than not showing captions, because the learner is looking at a caption track they believe is accurate but which contains systematic errors. Absorb's compliance configuration requires uploading verified SRT or VTT files for every video asset, not configuring the display toggle.

Workday Learning

Workday Learning's integration with Workday HCM means that training content is managed alongside HR data in a single system. Captions in Workday Learning are added at the video level, and Workday has no native ASR. The primary auto-caption risk in Workday Learning comes from embedded YouTube or external video links — when a learning admin links to an external YouTube video in a Workday Learning course, the YouTube auto-captions are what learners see. Workday Learning's architecture makes it relatively straightforward to use the SRT upload approach for hosted video content; the challenge is the common practice of linking to external video rather than hosting video inside Workday, which imports whatever caption configuration the external platform uses. For Workday Learning compliance, the safest configuration is hosting video content inside Workday or through an integrated media platform with verified caption files, rather than linking to external YouTube or Vimeo URLs with auto-generated captions.

The cross-LMS pattern: every LMS platform has a mechanism for uploading verified SRT or VTT caption files. Every LMS platform also has at least one pathway — embedded YouTube, built-in ASR, inherited platform captions — through which non-compliant auto-generated captions can appear in the training library. The compliance problem is not that the LMS platforms lack caption support; it is that the most convenient caption pathway (auto-generation or embedding with inherited captions) is also the non-compliant pathway. The LMS migration caption checklist covers how to audit and remediate caption configurations when moving between platforms.

Content-type decision matrix

The accuracy gap between auto-captions and the 99% threshold is not uniform across content types. It is smallest on conversational general-speech content and largest on the content types most common in formal L&D programs. The decision matrix below maps content type to auto-caption accuracy range, compliance status, and human review requirement. The accuracy ranges reflect DCMP-protocol word accuracy on the specific content type, not general-speech benchmark performance.

Auto-caption compliance status by content type
Content type	Auto-caption accuracy range (DCMP)	Meets 99% threshold?	Human review required?
General executive communication (all-hands, welcome videos)	88–93%	No — technical segments fail	Review recommended; required for recorded distribution
Compliance and regulatory training (HIPAA, OSHA, ADA, GDPR)	79–87%	Never	Yes — always
Technical / engineering training	76–85%	Never	Yes — always
Medical and clinical training	74–83%	Never	Yes — always
Sales enablement and product training	78–87%	Never (product names fail systematically)	Yes — always
Safety / EHS training	82–89%	Never	Yes — always
General soft-skills (no technical terms, no proper nouns)	90–95%	Verify per clip — never assume	Review recommended; document verification if skipping
Onboarding with company-specific vocabulary	81–88%	Never	Yes — always
Leadership and management training	85–92%	Rarely — contains role titles and process names	Review recommended
DEI and HR policy training	83–90%	Rarely — contains policy citation language	Review recommended

Reading the matrix

The "Never" entries in the "Meets 99% threshold?" column reflect a consistent finding: content types that involve specialized vocabulary — regulatory citations, technical terminology, product names, medical terminology, organizational proper nouns — consistently fail the 99% DCMP threshold regardless of which auto-caption platform generates the output. This is not a vendor-specific finding; it is a structural consequence of the mismatch between how ASR models are trained (on general conversational speech) and what technical training content contains (domain-specific vocabulary at 5–15× the density of those training corpora).

The soft-skills category shows 90–95% accuracy on a "no technical terms, no proper nouns" basis — but very few real soft-skills training videos actually meet those criteria. A leadership training video that references "the Agile methodology" has just added a proper noun. A communication skills module where the instructor introduces themselves as "Sarah Kowalski, Head of Learning Experience at Meridian Health" has added a personal name, a job title, and a company name. An unconscious bias training video that discusses "the Rooney Rule" has added a regulatory/policy proper noun. In practice, the content profile that would reliably score 90%+ without any proper nouns or domain terms is extremely rare in an organizational training library. The practical rule: verify per clip, not per content category, and document the verification.

The category misclassification problem

The content-type matrix creates an apparent solution: put technical content through human review and auto-caption general content. In practice, this tiered approach fails because the tier assignment requires knowing which content has technical vocabulary — and L&D libraries are not organized by vocabulary profile. They are organized by topic (onboarding, compliance, skills), by audience (engineering, sales, HR), and by format (video, course, job aid). None of these organizational schemes reliably predicts ASR failure profile. An "onboarding" video might be a general welcome message (90–93% auto-caption accuracy) or a system administration walkthrough (76–82%). A "leadership" video might be a conversational interview with the CEO (89–92%) or a detailed walkthrough of the organization's project management framework referencing tools, acronyms, and platform names (82–86%). Without a vocabulary analysis at the asset level — not the category level — a tiered auto-caption strategy will systematically misclassify content and produce compliance exposure in the "general content" tier. The glossary architecture post covers how to approach vocabulary analysis at scale across an L&D library.

The five conditions for compliant auto-caption use

Auto-captions are not inherently non-compliant. The regulatory frameworks do not prohibit them. What makes auto-captions compliant or non-compliant is whether the output meets the accuracy standard — and whether the organization can document that the output meets the accuracy standard. The five conditions below represent the complete set of requirements that must be simultaneously satisfied for auto-generated captions to be a defensible compliance solution. Any single condition not met means the auto-caption approach is not compliant for that content.

Condition 1: Content type is conversational speech with minimal specialized vocabulary

The content must have minimal technical terms, proper nouns, acronyms, regulatory citation names, product names, organizational names, and domain-specific terminology. In practice, this means conversational general-speech content with no organizational vocabulary — no instructor names, no company references, no tool names, no role titles. Very few videos in an organizational training library meet this criterion. The practical test: read a random 60-second segment of the transcript and count proper nouns and domain-specific terms. If there are more than two or three per minute, the content does not meet condition 1.

Condition 2: Accuracy is independently measured at 99%+ using the DCMP word-accuracy protocol

The organization must have independently verified that the auto-generated captions on the specific content achieve 99%+ word accuracy using the DCMP Captioning Key measurement protocol. This means: selecting a sample of the caption file (typically 10–15% of the total running time, drawn from multiple segments), comparing the caption text to a manually transcribed reference transcript word by word, counting substitutions, insertions, deletions, and formatting errors, and calculating accuracy as (correct words / total words) × 100. Vendor accuracy claims, platform documentation, and internal estimates do not satisfy condition 2. A documented measurement on the actual content, using the DCMP protocol, is required. The caption QA methodology post covers the DCMP sampling protocol in operational detail.

Condition 3: The accuracy verification is documented

The measurement conducted under condition 2 must be recorded with sufficient detail to be produced in an audit or OCR investigation. Documentation must include: the content title and unique identifier (video URL or LMS asset ID), the sample segments used for measurement (timestamps), the reference transcript methodology (how the reference was created), the error count by category (substitutions, insertions, deletions, formatting errors), the calculated accuracy percentage, the date of measurement, and the name of the person who conducted the measurement. A note in a spreadsheet that says "checked — looks good" does not satisfy condition 3. A log entry with all of the above fields does. The caption programme governance and policy template includes a documentation template for this log.

Condition 4: Accuracy is reverified when content or ASR model updates

Caption accuracy is not a one-time measurement. Two events trigger a requirement to reverify: (a) the content is updated, and (b) the platform's ASR model is updated. Content updates introduce new vocabulary that was not present when the original measurement was conducted; the new vocabulary may score below threshold even if the original content passed. ASR model updates — which platforms deploy without notifying users, sometimes multiple times per year — may change the model's performance on specific vocabulary categories in either direction. Organizations cannot rely on a single accuracy measurement conducted at upload time as a permanent compliance certification. A recurring verification schedule — quarterly, or triggered by content or model update events — is required to satisfy condition 4. Platform ASR model update detection is a known challenge: no major LMS or video hosting platform notifies users when the underlying ASR model changes.

Condition 5: An exception and remediation process exists

Even content that passes conditions 1–4 at one point in time may fail at a subsequent measurement. The organization must have a defined process for what happens when auto-caption accuracy falls below threshold: who is notified, what the remediation workflow is (human review, regeneration with updated glossary, manual correction), and what the timeline for remediation is. Without a defined remediation process, an organization that discovers a non-compliant caption file has no documented path to resolution — which means it is in a discovery-without-remediation posture, which is worse from a compliance standpoint than having no measurement process at all (because it documents awareness of the problem without documenting action). The remediation process should be part of the caption programme policy document, not an ad hoc decision made each time a problem is found.

Why conditions 1 and 2 together eliminate most training content

In practice, the content that passes condition 1 (minimal specialized vocabulary) is rare in organizational training libraries — most training content has organizational vocabulary that breaks auto-caption accuracy. The content that passes condition 2 (measured at 99%+ on DCMP protocol) is even rarer, because platform auto-captions on training content consistently score 76–89% on DCMP measurement. The intersection of "content with minimal specialized vocabulary that still scores 99%+ on DCMP measurement" is a very small subset of any training library — roughly the conversational, general-speech content with no organizational vocabulary and ideal recording conditions. For everything else, auto-captions require either a glossary-biased AI workflow that raises initial accuracy to 97–99% (which GlossCap's per-customer glossary approach is designed to provide), or human review before publication. The caption feedback loop post covers how a glossary-biased workflow can reduce the human review burden by raising the starting point from 82% to 97%, making the review step a light editing task rather than a full transcription re-check.

What "compliance through process" actually requires

Some L&D operators approach the auto-caption compliance question with a procedural solution: if the technology cannot guarantee compliance, the organization will implement a process that verifies compliance before each video is published. This is a legitimate compliance strategy. It is also a heavier operational commitment than it appears on first inspection, and it fails more often than it succeeds in practice because the process documentation requirements are not well understood until the organization is in an OCR investigation.

The documentation burden equals the production burden

For teams that choose to rely on auto-captions as their caption source, the documentation burden required to make that reliance defensible is comparable in effort to the effort required to process content through a human review workflow. The DCMP-protocol accuracy measurement takes 15–30 minutes per video for a trained reviewer, plus the time to create the reference transcript (another 30–60 minutes for a 10-minute clip). The documentation logging takes 10–15 minutes per video. For a library of 200 videos, the total documentation effort for auto-caption compliance verification is 200–400 hours — comparable to running all 200 videos through a light human review step in a structured workflow. The practical implication: the argument for using auto-captions to save time over a human review workflow only holds if the organization is not going to do the DCMP verification. If the organization is going to do the DCMP verification, the time savings over human review nearly disappear, because the verification is nearly as labor-intensive as the review would have been.

What an audit actually asks for

An OCR investigation that uncovers auto-generated captions in an organization's video library will ask for: (1) the accuracy measurement methodology (how do you know the captions meet the accuracy standard?), (2) the sample of content examined and the sampling approach, (3) the accuracy score per content type, (4) the documentation of who conducted the measurement and when, (5) the remediation threshold (at what accuracy score is a video sent for human review?), and (6) the review cycle (how often is accuracy re-verified?). "We checked and it looked okay" is not a methodology. "The platform says it's 99% accurate" is not a methodology. "We spot-checked a few clips" without recording which clips, what the reference was, or what the scoring method was, is not a methodology. What counts as a documented methodology is a written procedure that describes each step in the measurement process, applied consistently to every video in the catalogue, with a dated log of results. The difference between a compliance programme and an undocumented hope is that log.

The practical tradeoff by content type

The cost-benefit calculus for compliance-through-process differs by content type. For technical training content — the 76–89% auto-caption accuracy range — the documentation effort is high, the verification will almost certainly reveal non-compliant accuracy, and the remediation will require human review anyway. Documenting that you found non-compliance and then remediated it is better than not measuring at all, but the total cost exceeds a workflow that processes the content through human review from the start. For soft-skills conversational content — where the 90–95% range creates at least a theoretical possibility of meeting the 99% threshold on specific clips — the documentation effort is more likely to result in compliant findings, and the cases where clips fail threshold will be fewer. For this content type, compliance-through-process is a reasonable approach if the documentation infrastructure exists. The overall practical answer: run technical content through a review workflow that produces compliant output, verify conversational content with documented measurements, and build the glossary infrastructure that raises AI output accuracy to the point where review is a light editing step rather than a full remediation. The 90-day caption compliance programme build post covers the operational infrastructure required to make this approach sustainable. The LMS caption audit methodology post covers how to assess the current state of an existing library before designing the remediation approach.

Eight failure modes: specific, operational, and common

These eight failure modes are not hypothetical scenarios. They are the patterns that appear repeatedly in OCR resolution agreements, in pre-OCR audit reports, and in the intake process when L&D teams contact GlossCap to address compliance gaps. Each represents a specific belief or practice that produces compliance exposure despite the organization's intention to be compliant.

1. "Live captions covered the all-hands event"

Zoom and Microsoft Teams generate real-time captions displayed during live sessions. These captions are session-state data — they exist only during the active session window and are displayed to participants as a live aid, not recorded or stored as a caption track. When the all-hands session ends and the recording is saved to SharePoint, Google Drive, Teams channel, or Vimeo, the recording contains the audio and video but no caption track. The live captions that participants saw on screen during the event are not embedded in the recording, cannot be exported after the session ends, and do not exist as a file that can be uploaded to the recorded video. The recording in SharePoint or Google Drive has no captions unless a separate SRT or VTT file is created and uploaded after the event ends. An organization that enabled live captions for every all-hands event it has run for the past two years has not captioned a single all-hands recording. The employee communications captioning post covers the post-event workflow that actually produces a captioned recording.

2. "YouTube auto-captions are WCAG-compliant"

YouTube's auto-captions, powered by a Whisper-based model since 2022, achieve 91–95% accuracy on general speech in ideal conditions. On technical training content — which includes the vocabulary categories most common in organizational training — YouTube auto-captions score 76–89% on the DCMP protocol. WCAG SC 1.2.2 requires accurate captions, and the enforcement standard for accuracy is 99%+ on the DCMP protocol. A YouTube auto-caption on a HIPAA compliance training video that scores 83% has not met the WCAG SC 1.2.2 standard. The video shows "CC" on the YouTube player and "captioned" in any LMS that embeds it by URL. Neither indicator tells you anything about accuracy. The YouTube CC indicator means "a caption file is attached to this video." It does not mean "this caption file meets the WCAG 2.1 AA accuracy standard for the content type in this video." These are different statements, and confusing them is the source of a large portion of the auto-caption compliance misunderstanding in the L&D field.

3. "The LMS shows 'captioned' so the captions are compliant"

LMS captioning status fields — in Cornerstone, Kaltura, Docebo, TalentLMS, Absorb, and every other major LMS — report whether a caption file is attached to the video asset. They do not measure caption accuracy. They do not verify that the caption file contains accurate text for the audio. They do not distinguish between a machine-generated file at 82% DCMP accuracy and a human-reviewed file at 99.4% DCMP accuracy. "Captioned: Yes" in an LMS report means "a file with the .srt or .vtt extension is associated with this video asset." It is a file presence indicator, not a compliance indicator. An LMS caption coverage report that shows 100% caption coverage does not tell you anything about whether those captions are compliant. The organisation that uses LMS caption coverage as a proxy for caption compliance is measuring the wrong thing, and will discover this when an OCR investigation samples the actual caption files and runs a DCMP-protocol measurement. The LMS audit methodology post covers how to move from coverage measurement to compliance measurement in each major LMS platform.

4. "We verified a few clips and they looked fine"

Informal spot-checks — watching a clip, reading the caption track while listening to the audio, and deciding "that looks fine" — are not an audit-defensible QA process for four reasons. First, they do not produce a quantitative accuracy score; "looks fine" cannot be converted into a DCMP percentage that an auditor can verify. Second, they are not systematic: which clips were checked? How were they selected? What percentage of the library? Without a sampling protocol, there is no way to determine whether the checked clips are representative of the full library. Third, they are not recorded: when were the clips checked? By whom? What were the specific segments? Without a log, the spot-check cannot be produced in a compliance investigation. Fourth, "looks fine" is a perceptual judgment that does not catch errors in technical terms that the reviewer is not familiar with: an L&D professional without a clinical background who watches a medical training video may not notice that "pharmacokinetics" was captioned as "farmco kinetics" because the mismatch is not obvious in rapid reading. What counts as a documented QA process is specific: defined methodology, defined sampling protocol, word-level scoring, dated log, identified reviewer. The caption QA methodology post covers the operational details of building this process.

5. "The platform guarantees 99% accuracy"

Platform SLAs and marketing materials that claim "99% accuracy" or "industry-leading accuracy" reference corpus-level metrics on benchmark datasets, not your content. The relevant questions to ask about any vendor accuracy claim are: (a) what dataset was the 99% measured on?, (b) what measurement methodology was used (WER on a corpus, or DCMP-protocol on sampled content)?, (c) does the measurement include content from your industry vertical?, and (d) does the vendor guarantee that accuracy on your specific content will meet 99% DCMP, with remediation if it does not? Without affirmative answers to all four questions, a "99% accuracy" marketing claim is not an enforceable accuracy representation for your content. The compliance obligation runs from your organization to the regulatory framework — not from the vendor to your organization. If an OCR investigation finds your captions below 99%, "our vendor claimed 99%" is not a defense against the compliance violation. It may support a contract dispute with the vendor, but it does not resolve the OCR investigation. A legally useful vendor accuracy commitment specifies the measurement methodology (DCMP Captioning Key, not WER on LibriSpeech), the content type scope (including your specific industry vertical and vocabulary profile), the sample size and sampling approach, and a remediation or re-delivery commitment when accuracy on your content falls below threshold. The Verbit vs GlossCap comparison covers how to evaluate accuracy commitments in vendor contracts.

6. "We don't have specialized vocabulary"

Every organizational training library has specialized vocabulary. Job titles, team names, location names, manager names, organizational unit names, process names, tool names, product names, and company-specific acronyms are present in virtually every video in every organizational training library, because organizations name things and training videos talk about those named things. The belief that "our content is simple" typically reflects unfamiliarity with how ASR models fail on organizational vocabulary rather than the actual absence of that vocabulary. An ASR model that achieves 95% accuracy on LibriSpeech will fail on "Priya Singh, Director of Enablement at GlobexCorp" because none of "Priya Singh," "Director of Enablement," or "GlobexCorp" are in its training distribution. The model will produce a plausible-sounding substitution that a reviewer unfamiliar with the person may not catch on a quick watch-through. At scale — across 300 videos with dozens of name and organizational term failures each — "we don't have specialized vocabulary" becomes an expensive mistake. The proper noun failure modes post covers the specific taxonomy of organizational vocabulary that breaks auto-caption accuracy, with examples from real training content.

7. "Soft-skills training doesn't need reviewed captions"

Even conversational soft-skills training — communication, leadership, DEI, team dynamics, coaching skills — contains the vocabulary categories that break auto-caption accuracy. Instructor names and credentials. Company-specific examples and case studies. Role titles and reporting structures. Organizational processes referenced as examples. Tool names mentioned in context. A soft-skills training video where the facilitator opens with "I'm Michael Okonkwo, and I've spent twelve years at companies like Meridian Advisors and Pacific Diagnostics working on team dynamics" has just introduced three proper nouns in the first sentence. A leadership development video that uses a real company reorganization as a case study has introduced the organizational vocabulary of that company. A DEI training that references "our ERG co-leads" and "the diversity council" has introduced organizational proper nouns. At the scale of a full soft-skills library, the cumulative error count from instructor names, company examples, and organizational vocabulary is significant. The 90–95% accuracy range for soft-skills content assumes ideally clean content with zero organizational vocabulary — which is not the same as "this is soft-skills content, therefore it is 90–95% accurate." The safest approach is to verify per clip, not assume per category.

8. "We're tiering to auto for general content and human review for compliance content"

A tiered auto-caption strategy — human review for compliance and technical content, auto-captions for general content — is structurally sound in concept and predictably problematic in execution. The problem is the tier-assignment step: to correctly assign a video to the "auto-caption eligible" tier, the organization needs to know that the video has minimal specialized vocabulary and that auto-captions on that video score 99%+ on the DCMP protocol. Knowing that requires doing the vocabulary analysis and the DCMP measurement — which is most of the effort of human review. L&D libraries are organized by topic and audience, not by ASR failure profile. The "general content" bucket in a course catalogue is not the same as the "content eligible for auto-captions" bucket. Without a vocabulary analysis at the asset level, every video assigned to the "auto-caption eligible" tier is an undocumented compliance risk. The tier assignment is where tiered strategies collapse: organizations assign tiers based on topic category (soft-skills = auto-caption, compliance = human review) rather than vocabulary analysis, systematically misclassify the content that falls between obvious categories, and discover the misclassification when learner complaints or an OCR investigation surfaces inaccurate captions in the "general content" tier. The solution is not to abandon tiering — it is to base tier assignment on a vocabulary analysis at the asset level, not on the content category label in the LMS. The caption programme governance template includes a tier-assignment criteria document based on vocabulary density, not content category.

Seven questions L&D teams ask about auto-caption compliance

Does WCAG actually specify 99%? I've read the standard and it only says "accurate.": WCAG 1.2.2 normative text does not include a percentage. The criterion requires captions to be provided for prerecorded synchronized media; "accurate" appears in the Understanding document (the informative guidance), not the normative criterion. The 99% figure comes from the DCMP Captioning Key, which is the industry methodology for operationalizing "accurate" — it specifies word-level measurement on sampled passages with a four-category error taxonomy. OCR resolution agreements and federal court decisions involving caption accuracy have consistently referenced the 99% DCMP threshold as the applicable standard. The gap between WCAG's non-numeric "accurate" and enforcement's 99% is technically real but practically irrelevant: no enforcement action since 2015 has accepted below-99% auto-caption accuracy as "accurate" for technical or specialized content. The 99% figure is where WCAG's normative language and enforcement practice converge, and designing your programme below that threshold means designing it to fail an audit. The 99% accuracy explainer post covers the DCMP measurement protocol with worked examples of real training content scored under the protocol.
If we run a QA process on auto-captions and verify accuracy, are they WCAG-compliant?: Yes — with conditions. If the QA process uses a documented methodology (DCMP sampling, word-level scoring against a reference transcript), the accuracy on the specific content type is verified at 99%+, the verification is documented with a dated log recording the sample segments, error counts, and accuracy percentage, and the process is recurring (content updates and ASR model updates trigger re-verification), then auto-generated captions that pass the QA are compliant output. The technology that generated the initial transcript is not the compliance criterion — the final accuracy of the published caption track is. This is why a glossary-biased AI workflow like GlossCap's — which raises initial AI output accuracy to 97–99% on technical training content — reduces the human review burden without eliminating the compliance obligation. When glossary-biased AI output is independently verified at 99%+ on the specific content type, and that verification is documented, the captions are compliant regardless of whether they were generated by AI or by a human captioner. The question to ask at every point in the workflow is: "Can I document that this caption file meets 99% DCMP accuracy on this content?" If yes, the captions are compliant. If not, they are not. The caption QA methodology post provides the operational framework for building this verification process.
How does OCR actually investigate caption accuracy? What triggers a complaint?: OCR investigations typically begin with a complaint from a student, employee, or participant who encountered access barriers — inaccurate captions that prevented understanding of course content, absent captions on content they were required to complete, or a pattern of degraded access across the training library. OCR then requests: a sample of caption files from the video library (typically 10–20 videos selected by OCR, not by the organization), documentation of the accuracy verification methodology, evidence of remediation processes for previously identified inaccurate captions, and the organization's captioning policy or accessibility statement. In enforcement resolutions, OCR has consistently required organizations that were found to have below-threshold auto-generated captions to: (a) remediate identified inaccurate captions within 60–180 days, (b) establish and document an accuracy verification process for all future video content, (c) conduct and document retroactive accuracy verification of the full video library, and (d) submit progress reports to OCR at defined intervals. OCR has specifically required organizations to replace auto-generated captions with human-reviewed captions where DCMP-protocol sampling of the auto-generated files demonstrated accuracy below threshold. For organizations with auto-captions across a large library, the OCR investigation process typically results in a 12–24 month remediation timeline with significant resource commitment — substantially more expensive than a proactive caption programme that prevents the investigation. The ADA Title II fix guide covers the triage and prioritization approach for organizations facing imminent compliance review.
We have 3,000 videos with auto-generated captions. Do we need to remediate all of them?: Not necessarily all at once — but the obligation to remediate does exist and cannot be deferred indefinitely. OCR has accepted phased remediation plans that prioritize content using four criteria: (1) enrollment, with the highest-enrollment and most-recently-completed courses remediated first, since those represent the highest aggregate learner impact; (2) regulatory requirement, with required compliance and safety training prioritized over optional professional development content, since mandatory training represents the highest compliance obligation; (3) complaint history, with any content that has generated an accessibility complaint moved to the front of the remediation queue regardless of other priority criteria; and (4) content currency, with actively published content remediated before archived or retired content. A documented remediation plan with a timeline, a prioritization methodology, and a progress tracking mechanism is significantly better than no plan — it demonstrates good-faith effort and gives an OCR investigator a structured remediation commitment to evaluate, rather than a general intention to fix things. The remediation plan does not eliminate the compliance obligation for the backlog, but it demonstrates that the organization has assessed the scope, committed to a timeline, and is making measurable progress. Organizations with large backlogs should also evaluate whether a glossary-biased AI workflow can accelerate remediation by reducing the cost per video of producing compliant caption output. The LMS audit methodology post covers how to scope a remediation backlog and build the prioritization model.
What is the difference between AI auto-captions and AI-assisted captions, from a compliance perspective?: The distinction that matters for compliance is not the technology in the generation step — it is whether human review and documented verification are part of the workflow before the caption file is published. "Auto-captions" in the sense of generated-and-published-without-review represent both the generation method and the compliance failure simultaneously: the organization generates captions automatically and publishes them without verifying that they meet the accuracy standard. "AI-assisted captions" — where AI generates an initial transcript and a human reviewer corrects and approves before publication — describe a production workflow, not a compliance status; what matters is whether the human review produced output that meets 99% DCMP accuracy and whether that accuracy is documented. A third category is increasingly important: glossary-biased AI captioning, where the AI model is conditioned on the organization's specific vocabulary (product names, proper nouns, domain terminology, organizational names) before generating the transcript. Glossary-biased AI output typically scores 97–99% on technical training content without human review, because the most common error category — low-frequency domain vocabulary — is corrected by the glossary at the generation step. GlossCap's workflow uses per-customer glossary biasing to raise initial transcript accuracy to 97–99%; teams that have documented that their content achieves 99%+ at the AI output stage (through a DCMP verification) can publish directly; teams that cannot verify that threshold use a human review step to reach it. The compliance framework is indifferent to how the compliant output was produced — only to whether it is accurate, documented, and recurring. The glossary architecture post covers how to build the per-customer vocabulary database that makes glossary-biased captioning effective at scale.
Are there content types where auto-captions reliably meet the 99% threshold without human review?: In theory, yes. In practice, rarely. The content profile that could meet 99% without human review or glossary biasing has five characteristics: (1) conversational speech with no technical terms, no domain-specific terminology, no regulatory citation names, and no acronyms; (2) no proper nouns — no personal names, place names, company names, product names, tool names, or role titles; (3) a single native English speaker with clear diction and a mainstream American or British accent; (4) a quiet, acoustically controlled recording environment with no background noise, echo, or ambient audio; and (5) a platform that documents its accuracy verification methodology on content that matches this profile and can show you the measurement. Soft-skills training that meets all five criteria could produce auto-captions that score 99%+ on the DCMP protocol. In reality, most organizational soft-skills training fails criterion 2 in the first few minutes: the instructor introduces themselves by name (a proper noun), mentions the company (a proper noun), refers to a role title (a proper noun), or cites a methodology or framework by name (a proper noun). The practical answer is: test your specific content before assuming auto-captions meet threshold, document the test, and do not extrapolate from one video's accuracy to an entire content category's accuracy. The content-type decision matrix in this post provides a starting framework; the DCMP measurement on your actual content is the verification. If you find content that passes on independent measurement, document it thoroughly — you have found one of the rare cases where auto-captions can be a defensible compliance solution without additional steps.
What happens if a vendor's marketing material says their captions are "99% accurate" — does that cover us?: No. Vendor accuracy claims in marketing materials are not contractual accuracy guarantees and are not measured against your specific content type. A vendor claim of "99% accurate" without specifying: (a) what content the 99% was measured on, (b) what measurement methodology was used (DCMP word-level vs. corpus WER vs. in-house benchmark), (c) whether your specific content type and industry vertical were included in the measurement, and (d) what the vendor's remediation commitment is when accuracy on your content falls below threshold — is not an enforceable accuracy representation for your compliance purposes. The compliance obligation runs from your organization to the regulatory framework (WCAG, ADA, Section 508) — not from the vendor to your organization. If an OCR investigation finds your captions at 83% accuracy on a DCMP measurement, "our vendor claimed 99%" is not a resolution of the compliance violation. It may support a contract dispute with the vendor, but the compliance investigation is about what your organization's captions do and do not do, not about what your vendor claimed. A legally useful vendor accuracy commitment specifies: the measurement methodology (explicitly the DCMP Captioning Key, not WER on a corpus), the content type scope (including your specific industry vertical, stated vocabulary profile, and recording conditions), the sample size used for any claimed accuracy measurement, and a contractual remediation or re-delivery commitment when accuracy verification on your content shows results below the stated threshold. Absent those specifics, a vendor's "99% accurate" claim is a marketing statement, not a compliance warranty. The vendor evaluation frameworks in the Rev vs GlossCap comparison and the captioning RFP playbook cover how to structure the accuracy commitment question in vendor selection.

See how GlossCap handles technical training vocabulary where auto-captions fail

GlossCap uses per-customer glossary biasing to raise AI caption accuracy to 97–99% on technical training content — the content types where YouTube, Teams, and platform auto-captions consistently score 76–89% on DCMP measurement. If your L&D team is using auto-generated captions on compliance training, engineering onboarding, medical or clinical content, or any video with organizational vocabulary, the gap between what you have and what WCAG requires is real and documentable. The GlossCap widget demo shows how glossary-conditioned caption output compares to standard auto-caption output on the same technical training clip — with the kind of accuracy difference that matters when the content contains the terms your learners need to get right. You can also explore the caption feedback loop workflow that allows your team to grow the glossary from production data so that accuracy improves with every video published, rather than starting from baseline with each new content type.