Technical Guidance · Published 2026-06-02

Live session captions vs recorded training video: why the accuracy requirements are completely different

Here is the scenario that catches nearly every L&D team off guard at least once. You host a 60-minute product-enablement session on Zoom. You have live captions enabled — the button is on, the text is scrolling, you are in compliance during the session. The recording auto-generates a .vtt file from the live caption track. The next morning, someone on your team uploads the MP4 to Panopto or Kaltura or TalentLMS along with the .vtt. The module is live. The LMS shows a caption track. A compliance reviewer asks if captions are present — the answer is yes. The problem is that the Zoom live caption track has approximately 82% word accuracy. The archived training module is now pre-recorded synchronised media under WCAG 2.1 AA. Pre-recorded synchronised media requires captions at 99%+ accuracy under WCAG SC 1.2.2. You are not compliant. You have captions. You are not compliant. This is the most common caption compliance gap in L&D organisations, and it exists because the two WCAG standards — SC 1.2.4 for live content and SC 1.2.2 for prerecorded content — have fundamentally different accuracy expectations, and live captioning technology cannot bridge that gap without a post-processing step. This post explains why the standards differ, how each major platform behaves in practice, what the accuracy data actually looks like, and how to build the two-stage workflow that closes the gap without creating a heavy manual burden on your team.

TL;DR

Two standards, two accuracy expectations: WCAG SC 1.2.4 (live captions) does not require 99%+ accuracy — the spec explicitly acknowledges that real-time captioning cannot match prerecorded accuracy. WCAG SC 1.2.2 (prerecorded captions) does require 99%+ accuracy on synchronized media. Both are Level AA.
The archival gap: When a live session recording is uploaded to an LMS or video library and made available on-demand, it becomes prerecorded synchronised media. SC 1.2.2 now applies — not SC 1.2.4. The live caption track (80–88% accuracy) no longer meets the applicable standard.
Platform accuracy in practice: Zoom, Teams, Webex, and Google Meet live captions run at 80–90% for standard English and 65–78% on technical vocabulary. Panopto and Mediasite generate captions post-upload (not during live sessions) at similar accuracy floors.
Custom vocabulary helps, but has limits: Zoom and Webex expose custom vocabulary lists that can add 5–10 accuracy percentage points on covered terms. Microsoft Teams requires Azure Speech Service configuration. Google Meet has no end-user vocabulary interface. All require pre-event setup that most teams skip.
The two-stage workflow: Stage 1 — enable live captions before the session starts (SC 1.2.4 compliance during the event). Stage 2 — run the recording through a glossary-corrected pipeline to reach 99%+ accuracy before LMS publication (SC 1.2.2 compliance for on-demand viewers). Both stages are required. Neither replaces the other.
The SLA: A reasonable policy is 24–48 hours between live session and on-demand availability, to allow Stage 2 processing. This is operationally achievable and significantly better than the current default in most organisations (publish same-day with uncorrected live captions, or no captions at all).

The two WCAG standards and what they actually require

WCAG 2.1 addresses captioning in four separate success criteria. For L&D teams the two that matter are SC 1.2.2 and SC 1.2.4, both at Level AA — the standard required by ADA, Section 508, and most contractual accessibility commitments.

SC 1.2.2 — Captions (Prerecorded)

SC 1.2.2 states: "Captions are provided for all prerecorded audio content in synchronized media, except when the media is a media alternative for text and is clearly labelled as such." The intent is that any video with a synchronized audio track — a training course, a recorded webinar, a lecture capture, a product demo — must have captions that meet a specific standard of quality.

The WCAG 2.1 understanding document for SC 1.2.2 specifies what "captions" means: the caption track must be accurate, synchronised with the audio, complete (covering all speech and auditorily significant non-speech sound), and must not be obtrusive to viewers who do not need them. On accuracy, the WCAG 2.1 supplemental guidance and the W3C's WCAG-EM interpretation have consistently held that auto-generated captions — even good ones — do not meet SC 1.2.2 unless they have been reviewed and corrected to eliminate errors. The 99%+ accuracy threshold that practitioners cite comes from this interpretation: at under 99% word accuracy, material information is regularly missing or substituted, which fails the accuracy requirement.

What falls under SC 1.2.2 for a typical L&D team: all training courses published in the LMS, all recorded webinar or session content made available on-demand, all product-enablement recordings available for async viewing, all compliance module videos, all onboarding recordings. If it is a video with audio that a viewer can play back at their own schedule, it is prerecorded synchronized media and SC 1.2.2 applies.

SC 1.2.4 — Captions (Live)

SC 1.2.4 states: "Captions are provided for all live audio content in synchronized media." The intent is that viewers who cannot hear the audio of a live presentation — a live training session, a company all-hands, a conference talk, a live compliance briefing — must be able to follow the content via captions in real time.

The WCAG 2.1 understanding document for SC 1.2.4 explicitly acknowledges the accuracy difference: "It is not possible for live captions to meet all of the requirements of real-time text." The note further states that live captioning systems produce text at high speed, under real-time constraints, and that the standard is met by providing captions that cover speech with reasonable synchronisation, even at lower accuracy than prerecorded captions. The practical interpretation: SC 1.2.4 requires that captions be present and usable during the live event. It does not require 99%+ accuracy. CART (human stenography) can approach 98%+ even live, but ASR-based live captioning at 80–90% is considered sufficient to meet the standard.

What falls under SC 1.2.4 for a typical L&D team: the live session itself — the 60-minute Zoom training, the Teams all-hands, the Webex product launch, the Google Meet onboarding session, the BigBlueButton academic lecture while it is in progress. Only the live event. Not the recording. The moment the event ends and the recording is processed and published, SC 1.2.2 applies to everything that was SC 1.2.4 one hour ago.

Why both are Level AA and what that means

Both SC 1.2.2 and SC 1.2.4 are Level AA — the standard required for compliance with WCAG 2.1 AA. This surprises some L&D professionals because they assume different accuracy expectations imply different compliance levels. They do not. The compliance requirement is met for live content by providing live captions at live-achievable accuracy, and met for prerecorded content by providing verified captions at prerecorded-achievable accuracy. The difference is not in the compliance level but in what "captions" means operationally in each context. Providing a live caption track is sufficient for SC 1.2.4. Providing that same track as the permanent caption file on the archived recording is not sufficient for SC 1.2.2.

Why live caption accuracy cannot reach 99% — and always will be limited

Understanding why live captions have an inherent accuracy ceiling helps explain why no platform setting or configuration fully closes the gap, and why the two-stage workflow is necessary rather than just a quality preference.

The latency constraint

Live captions must appear on screen within 2–4 seconds of speech for viewers to follow the session meaningfully. This real-time window is the binding constraint on accuracy. The automated speech recognition pipeline for prerecorded content runs in batch mode: the audio file is processed end-to-end, multiple acoustic models are applied, contextual re-scoring reads forward and backward in the transcript to resolve ambiguity, and glossary-biasing applies to every word across the full context of everything that came before and after. None of this is possible in a 2–4 second window. The live ASR pipeline operates on a rolling buffer of the last few seconds of audio, makes a best-guess hypothesis, and commits it as the caption — without the benefit of what will be said in the next sentence, which often resolves the current ambiguity.

This is not a solvable engineering problem — it is a physics problem. You cannot know what the speaker is about to say before they say it. Prerecorded captioning has this information. Live captioning never will. The practical consequence is that the proper-noun disambiguation that glossary-biased captioning handles well in the batch pipeline — resolving "GlossCap" vs "gloss cap" vs "gross cap" based on all surrounding context — is much harder in the live pipeline because the disambiguation window is four seconds, not four minutes.

The speaker variability problem

Live training sessions have characteristics that reduce ASR accuracy beyond the baseline measured on benchmark audio. Multiple speakers shift mid-sentence or interrupt. Background noise from home offices, open-plan workspaces, or unmuted participants accumulates. Recording quality varies by participant equipment — the session host may have a high-quality headset; the SME presenter may be on a laptop microphone from a conference room. Speaker turns happen with no pause, and the ASR model must re-calibrate to the new speaker's voice within the live context window.

Recorded training video for the LMS is typically recorded under controlled conditions: a single speaker, a professional microphone, a quiet room, often a deliberate recording pace. The comparison is not fair to live captioning. Live sessions are inherently noisier, more variable, and harder to transcribe than polished training recordings — and the live caption track reflects that. When that track is used as the permanent caption file on the archived recording, the accuracy floor of a noisy-conditions multi-speaker session gets embedded into a module that will be replayed hundreds of times under SC 1.2.2 standards.

The technical vocabulary problem

Training content has the highest proper-noun density of any speech category. The fifteen categories of words that break auto-captions are most concentrated in L&D content: product names, feature names, SDK symbols, regulatory citations, medical terms, OSHA standards, company-proprietary vocabulary. The accuracy degradation is not uniform — it is worst exactly where it matters most. A sentence that says "GlossCap uses Whisper-large with glossary-biased decoding to reach WCAG 2.1 AA accuracy" might caption as "gloss cap uses whisper large with glossary biased decoding to reach wcag two one aa accuracy" — all proper nouns are incorrect or uncapitalised, the technical term is wrong, and the 15-word sentence has 5–6 errors. That is not 99% accuracy. It is not even close. And this is the exact sentence type that training video produces at high density, because training video is about the company's products, tools, processes, and terminology.

The live ASR pipeline is particularly vulnerable because the 2–4 second context window makes it impossible to use the surrounding sentence to resolve whether "whisper" means the model name, the verb, or the ambient reference. The batch pipeline has the full paragraph as context. Glossary-biased decoding in the batch pipeline can resolve product names with high reliability. In the live pipeline, the same technique is applied with a fraction of the available context.

Platform-level accuracy data

The figures below reflect observed accuracy on training content specifically — not general speech benchmarks, which consistently overstate performance on L&D content because they are measured on news broadcasts and conversational speech with low proper-noun density. Training content accuracy is typically 8–15 percentage points below the headline figures for the same ASR model on general speech.

Platform	Accuracy on standard English speech	Accuracy on technical training vocabulary	Custom vocabulary support
Zoom AI Companion	82–88%	65–75%	Yes — up to 100 terms (account-level)
Microsoft Teams	85–90%	70–78%	Via Azure Speech Service (admin config)
Webex Meetings	80–88%	65–75%	Yes — Named Entity Vocabulary (CSV, admin)
Google Meet	82–88%	68–76%	No end-user interface
CART (human stenography)	95–98%	92–97%	N/A — human provides accuracy
Panopto ASR (post-upload)	80–87%	63–73%	Yes — word list (Enterprise tier)
Mediasite ASR (post-upload)	79–86%	62–72%	Limited — contact vendor

Two important notes on these figures. First, the custom vocabulary improvements (where available) assume the vocabulary list has been configured before the session. Most teams do not configure this as a standard pre-event step, and the out-of-box accuracy figures apply. Second, Panopto and Mediasite do not generate live captions during the session — their ASR runs post-upload. This means they have no SC 1.2.4 exposure (live captions are the responsibility of whatever live-session platform is being captured), but their post-upload accuracy figures still fall well short of the SC 1.2.2 standard without correction.

The archival gap: when live becomes recorded

The archival gap is the specific mechanism through which most L&D teams accumulate caption compliance failures without realising it. Understanding the exact path from live session to LMS module makes the gap visible — and fixable.

The journey of a Zoom training session

Consider a standard L&D delivery pattern: a 45-minute Zoom-hosted product-enablement session for a new cohort, recorded to the cloud for async replay by employees in other time zones.

Session starts. Zoom AI Companion captions are enabled. Captions are displayed to participants in real time. Word accuracy on the product-specific vocabulary is approximately 68–72%. For live accessibility (SC 1.2.4), this is compliant — captions are present, synchronised, and cover speech.
Session ends. Zoom generates a cloud recording: MP4 file, and a .vtt file produced from the live caption transcript. The .vtt file encodes the live caption output — 68–72% accurate on the technical vocabulary.
L&D team receives the recording. The recording link is shared via Zoom's cloud recording notification. The team downloads the MP4. Many teams also download the .vtt.
Upload to LMS. The MP4 is uploaded to Panopto, Kaltura, TalentLMS, or the LMS's native video host. The .vtt from Zoom is attached as the caption track. Or Panopto generates its own post-upload ASR (approximately 80% accuracy). Or the team uploads the MP4 with no caption track and plans to "fix it later."
Module goes live. The recording is now an on-demand LMS module. It will be replayed by employees who were not at the live session, employees returning to review specific segments, and employees completing the course as part of a compliance track.
SC 1.2.2 applies. This is no longer a live event. It is prerecorded synchronised media. WCAG SC 1.2.2 requires captions at 99%+ accuracy. The caption track — whether the Zoom-generated .vtt or Panopto's post-upload ASR — is at approximately 70–80% accuracy. The module is not compliant.

The team sees "captions present" in the LMS and considers the accessibility requirement met. From the WCAG standard's perspective, coverage is met (captions exist), but accuracy is not (captions do not meet the 99%+ threshold). This is the archival gap: not the absence of captions, but the presence of low-accuracy captions inherited from the live workflow applied to a pre-recorded use case.

Which live recordings fall into the archival gap?

The gap applies whenever a live session recording is published in any on-demand format. In L&D practice, this includes:

Zoom cloud recordings uploaded to the LMS or shared via a permanent link
Teams meeting recordings published to SharePoint or a course library
Webex recording links shared in the LMS as module content
Google Meet recordings saved to Google Drive and embedded in course material
BigBlueButton session recordings published via the BBB playback interface or exported and uploaded
Panopto recordings of live classroom sessions (Panopto Remote Recorder captures and processes the recording — SC 1.2.2 applies to the processed recording)
Mediasite lecture-capture recordings published via the Mediasite library
Town hall recordings archived for employees who could not attend
Webinar recordings turned into CPE or continuing-education modules
Product launch session recordings published in the internal enablement hub

The only live recordings that do not fall into the archival gap are those that are genuinely never archived — the session happens, no recording is kept, and no on-demand replay is offered. This is rare in L&D practice. The economic logic of live session recording (allow async replay, build the course library, reach employees in other time zones) means most live sessions become pre-recorded content within 24–48 hours.

The false-positive problem in compliance tracking

Many L&D teams track caption coverage with a "has captions: yes/no" field in their LMS or content management system. This field correctly identifies videos with no caption track (a clean compliance failure). It does not distinguish between a 99%+ accuracy corrected caption file (compliant under SC 1.2.2) and a 72% accuracy live-generated .vtt (not compliant under SC 1.2.2). Both show "has captions: yes." From the coverage metric, both look the same. From the WCAG standard, they are completely different.

The implication: if your compliance tracking system only measures coverage (captions present or absent), you may have 100% coverage and significant non-compliance. A caption compliance program needs to track two metrics — coverage (percentage of videos with any caption track) and accuracy compliance (percentage of videos with a WCAG 2.1 AA-grade caption track at 99%+ accuracy, documented). Only the second metric answers the audit question. The archival gap specifically inflates the coverage metric without improving the accuracy-compliance metric.

Platform-by-platform: what each tool produces and what needs to change

The specific workflows for addressing the archival gap differ by platform because each platform generates captions differently and produces different output formats. Here is the operational picture for each major live-session and lecture-capture tool used by L&D teams.

Zoom

Zoom's automated captioning is built into the AI Companion feature (formerly Zoom IQ). For accounts with AI Companion enabled, live captions are generated during the meeting and a .vtt transcript is produced as part of the cloud recording package.

Live caption configuration: In Zoom Admin settings, navigate to Account Management > Account Settings > Meeting > In Meeting (Advanced). Enable "Automated captions" and optionally "Full transcript." For custom vocabulary, navigate to Account Management > AI Companion > Automated captions > Custom vocabulary. Up to 100 terms can be entered as a comma-separated list. Terms are applied account-wide — you cannot configure per-meeting vocabulary without creating sub-accounts.

What Zoom produces: Cloud recording includes an MP4 file and a .vtt file. The .vtt reflects the live caption output — approximately 82–88% accuracy on standard speech, 65–75% on technical training vocabulary with custom vocabulary configured.

The archival gap in Zoom: The .vtt from a Zoom recording is frequently uploaded directly to an LMS as the permanent caption track. This is the exact archival-gap scenario. The .vtt file needs to pass through a corrected captioning pipeline before LMS attachment.

Recommended workflow: After downloading the Zoom cloud recording, run the MP4 through GlossCap with your company glossary loaded. GlossCap generates a corrected .vtt at 99%+ accuracy on your technical vocabulary. Use the corrected .vtt — not the Zoom-generated .vtt — when uploading to the LMS. Discard or archive the Zoom .vtt for internal reference only.

Microsoft Teams

Teams meeting captions are built into the Teams client. During a meeting, any participant can enable live captions via More > Language and speech > Turn on live captions. No additional license is required for basic captions. Teams Premium adds speaker attribution in the transcript and real-time translation.

Live caption configuration: Teams does not expose a custom vocabulary interface at the end-user or standard admin level. Enterprise tenants with Azure Speech Service integration can configure a pronunciation lexicon (a PLS file, XML-format) through the Azure Speech resource. Most Teams deployments do not have this configured, and the out-of-box live caption accuracy applies: approximately 85–90% on standard English, 70–78% on technical vocabulary.

What Teams produces: Teams recordings are saved to OneDrive or SharePoint (depending on tenant configuration). The recording includes the MP4 and a transcript file (.vtt) generated by Teams from the meeting audio. Teams also produces an in-meeting chat-style transcript in the Teams interface. The downloadable .vtt from SharePoint reflects the full-session transcription — accuracy is generally slightly higher than the rolling live-caption display because Teams does a second-pass on the recording segment after upload, but it does not approach 99%.

The archival gap in Teams: Teams recordings published to SharePoint or linked from the LMS carry the Teams-generated transcript as captions. If the recording is embedded in a Teams channel or SharePoint page and made available on-demand, SC 1.2.2 applies. The Teams transcript needs correction before on-demand publication.

Recommended workflow: Download the Teams recording (MP4 + .vtt) from SharePoint. Run the MP4 through a glossary-corrected captioning pipeline. Replace the Teams .vtt with the corrected output. Re-upload to SharePoint or LMS with the corrected caption track attached.

Webex

Webex Meetings provides built-in captions with one of the better custom vocabulary implementations among live-session platforms. The Named Entity Vocabulary (NEV) feature, available to enterprise administrators, allows uploading a CSV file of company-specific terms. Webex applies the NEV list to the AI captioning model — the improvement is meaningful for covered terms (approximately 8–12 accuracy points on the specific vocabulary in the list).

Live caption configuration: In Webex Control Hub (admin portal), navigate to Services > Meetings > Settings > Language Intelligence > Custom Vocabulary. Upload a CSV with term, alternative spellings (optional), and language. For end-user activation, the meeting host enables captions via the Participants panel > Closed captions. Webex also supports CART integration — an external caption provider can join the session via a URL and the CART output becomes the caption stream.

What Webex produces: Webex cloud recordings include MP4 and .vtt files. If NEV is configured and the vocabulary list covers the session's technical content, the .vtt accuracy will be higher than generic ASR output — but still below 99% without post-processing.

The archival gap in Webex: Identical to Zoom and Teams. The cloud recording .vtt needs correction before use as a permanent LMS caption track.

Google Meet

Google Meet provides live captions via Google Speech-to-Text, enabled with a single click in the bottom toolbar. Meet captions are clean and reliable for standard English but have no end-user vocabulary customisation — Google Workspace admins cannot configure phrase hints or custom vocabulary for Meet captions.

Live caption configuration: Bottom toolbar > CC button or keyboard shortcut. No configuration options at the user or admin level for vocabulary customisation in Meet specifically. Google Speech-to-Text speech adaptation (phrase hints) is available via the API for developers, but this does not connect to the Meet consumer product.

What Google Meet produces: Meet recordings saved to Google Drive include an MP4 and a separate caption file in .sbv format (SubRip SubViewer — similar to .vtt but with different timestamp formatting). The caption file reflects the live transcript. The .sbv must be converted to .vtt for most LMS platforms. Accuracy: approximately 82–88% on standard English, 68–76% on technical training vocabulary.

The archival gap in Google Meet: The .sbv caption file needs conversion to .vtt and correction before LMS attachment. The additional conversion step (sbv → vtt) is a minor workflow friction point but should not be skipped — the .sbv file needs to pass through a corrected captioning pipeline regardless.

Recommended workflow: Download the Meet recording (MP4) and caption file (.sbv) from Google Drive. Run the MP4 through the corrected captioning pipeline (the .sbv can optionally be supplied as a rough transcript starting point to speed processing). Use the corrected .vtt for LMS upload. Convert .sbv to .vtt only if the live caption file is needed for any parallel use — the corrected .vtt supersedes it for the archived module.

BigBlueButton

BigBlueButton (BBB) is an open-source webconferencing platform widely used in higher education, particularly in Canvas LMS environments and Moodle deployments. BBB takes a different approach to live captioning than commercial platforms: rather than building ASR-based captions into the platform, BBB's native design assumes CART — an external caption provider joins the session via a specific role, and their real-time typed output becomes the caption stream for all participants.

Live caption options in BBB: The original CART-first design achieves 95–98% live accuracy but requires a trained stenographer at $100–200/hour. BBB 2.5+ added an automated subtitle plugin via the BBB-notes integration (using an external STT API). Some institutions also route BBB audio through Kaldi ASR or Google Speech-to-Text via custom integrations. Accuracy on the automated path: similar to commercial platforms, 80–88% on standard English.

Academic deployment considerations: For university lecture captions, the dominant pattern is automated captions for routine sessions and CART for formal accessibility accommodations (when an enrolled student has a documented disability accommodation requiring CART). The CART sessions achieve SC 1.2.4 compliance at near-archive quality. The automated sessions achieve SC 1.2.4 compliance at the same 80–88% accuracy floor as commercial platforms.

What BBB produces: BBB playback recordings include a subtitles track derived from the caption session. For CART sessions, the subtitle quality is very high. For automated sessions, the subtitle quality reflects the ASR accuracy floor. For LMS integration (BBB recordings embedded in Canvas or Moodle), the same archival gap applies — the recording is now prerecorded content and SC 1.2.2 applies.

Panopto

Panopto is a lecture-capture and video management platform used primarily in higher education and enterprise L&D. Panopto's captioning is structurally different from the live-session platforms: Panopto does not provide real-time live captions during a recording session. Captions are generated post-upload via Panopto's ASR module.

What this means for SC 1.2.4: If a live lecture or training session is captured via Panopto Remote Recorder (or Panopto's Mac/Windows recorder), there are no live captions during the session. SC 1.2.4 compliance during the live event requires that the live-session platform — Zoom, Teams, Webex, BBB, or the lecture hall's Zoom Rooms integration — provides live captions independently. Panopto's post-processing pipeline does not serve the live accessibility requirement.

Post-upload captioning: After a recording is uploaded to Panopto, the ASR module runs automatically (if configured) or can be triggered manually. Panopto Enterprise supports a custom word list for ASR biasing. The post-processing pipeline runs at approximately 80–87% accuracy on standard English and 63–73% on technical training vocabulary. This is somewhat lower than the live-session platforms because Panopto's ASR module is optimised for asynchronous processing speed rather than quality — it is not equivalent to a dedicated batch-mode corrected captioning workflow.

Editing in Panopto: Panopto's transcript editor allows word-by-word correction directly in the browser. For a one-hour lecture with 65% accuracy on technical terms, this means manually correcting several hundred errors. The editing interface is competent but the volume of corrections required for technical content makes manual editing unsustainable as a routine workflow.

Recommended approach: For content where SC 1.2.2 compliance is required, bypass Panopto's ASR module (or run it for rough reference only) and run the recording through a dedicated corrected captioning pipeline with your company or subject-specific glossary. Import the corrected .vtt into Panopto's transcript editor as a replacement caption file, or upload via the Panopto captions API.

Mediasite

Mediasite is a lecture-capture and enterprise video management platform similar in architecture to Panopto. Like Panopto, Mediasite generates captions post-upload rather than in real time. Mediasite's ASR module accuracy is comparable to Panopto — approximately 79–86% on standard English, 62–72% on technical training vocabulary.

Mediasite captioning workflow: Mediasite Lecture Capture or Mediasite Producer creates a recording. After upload, the ASR module transcribes the audio. The transcript is available for editing in the Mediasite presentation editor. Mediasite supports caption file import (.srt, .vtt) for replacing ASR output with an externally corrected caption file.

The Mediasite use case: Mediasite is common in higher education (lecture hall capture) and in healthcare training environments. For healthcare training content — where drug names, procedure codes, anatomy terms, and institutional vocabulary appear at high density — the gap between Mediasite's ASR accuracy (62–72% on technical vocabulary) and the SC 1.2.2 requirement (99%+) is particularly acute. A pharmacology lecture where every drug name is incorrectly captioned is not providing accurate captions in any meaningful sense of the word.

Recommended approach: Same as Panopto. Run the Mediasite recording through a glossary-corrected external pipeline and import the resulting caption file via Mediasite's import interface.

Custom vocabulary in live captioning: what it does and what it does not do

Custom vocabulary support in live captioning platforms is real and useful — it is not a marketing feature. Configuring a vocabulary list before a session can meaningfully reduce the proper-noun error rate on covered terms. But it has limitations that prevent it from closing the archival gap on its own.

How custom vocabulary works in live ASR

Most commercial live ASR systems support phrase biasing or boosting: a weighted vocabulary list is loaded into the acoustic model's decoding layer, making it more likely to produce the listed term when the audio phonetically matches or approximately matches it. The effect is directional — it shifts probability toward the listed term — rather than absolute. If the speaker says "GlossCap" and "GlossCap" is in the vocabulary list, the ASR is more likely to output "GlossCap" than it would be without the list. It is not guaranteed. The strength of the bias depends on implementation.

The accuracy improvement from a well-curated vocabulary list typically adds 5–12 percentage points on covered terms. If the untrained model would produce 68% accuracy on a 100-word technical passage, a vocabulary-biased model might produce 75–78% on the same passage. This is meaningful — it is the difference between "gloss cap whisper large glossary bias decoding" and "GlossCap Whisper-large glossary-biased decoding" for covered terms. It does not produce 99% accuracy.

Zoom custom vocabulary: practical configuration

Zoom's custom vocabulary is configured in Admin Settings > AI Companion > Automated captions > Custom vocabulary. Terms are entered as a comma-separated or newline-separated list. Zoom processes the list phonetically and applies it to the AI Companion caption model. Limitations: the list is account-wide (no per-meeting vocabulary), capped at 100 terms, and applied at the account level which means it applies to all meetings — you cannot configure "sales team vocabulary" for sales meetings and "engineering vocabulary" for engineering meetings without separate sub-accounts.

Best practice for Zoom vocabulary: populate the 100 slots with your organisation's highest-failure-rate terms — product names, SDK symbols, proprietary tool names, executive names, company-specific acronyms. These are the terms that appear in every training session and that the generic model never produces correctly. Leave general vocabulary (common technical words that the model handles adequately without biasing) off the list.

Webex Named Entity Vocabulary: the most flexible option

Webex's NEV is the most mature custom vocabulary feature among the major live-session platforms. The CSV format allows specifying multiple alternate spellings or pronunciations per term, which is useful for terms with non-obvious phonetic representations ("psych-o-log-ee" for "psychology" in a non-native speaker's accent, or "gloss-cap" vs "glos-cap"). The list is not capped at 100 terms — larger lists are supported for enterprise accounts. NEV applies per-language, which matters for multilingual training organisations. The vocabulary is configured at the Control Hub admin level and applied to all Webex Meetings sessions on the account.

The accuracy improvement with a well-configured NEV list is approximately 8–12 points on covered terms — slightly better than Zoom's implementation, likely because the multi-spelling format provides the biasing system more information to work with.

Microsoft Teams: the gap and workaround

Teams does not provide a custom vocabulary interface for standard enterprise customers. This is a genuine gap relative to Zoom and Webex. The Azure Speech Service approach — configuring a pronunciation lexicon (PLS file) through the Azure portal — requires: an Azure subscription linked to the Microsoft 365 tenant, Azure Speech Service resource creation, PLS file preparation (XML format, quite technical), and linkage of the Speech Service resource to Teams. Most L&D teams do not manage Azure infrastructure directly and will need IT collaboration to configure this. In practice, most Teams deployments run with no vocabulary customisation and the out-of-box accuracy applies.

For Teams-heavy organisations where live captioning accuracy is a significant concern, the workaround is to invest in post-processing quality — configure the Teams recording workflow to always route through a corrected captioning pipeline before LMS publication, rather than trying to improve live accuracy at the source.

What custom vocabulary cannot fix

Even with optimal vocabulary configuration, live captioning cannot produce 99%+ accuracy on technical training content. The reasons are structural:

Terms not in the vocabulary list. A 100-term list covers the highest-frequency proper nouns. A typical software-product training video contains 400–600 unique product-adjacent terms (feature names, module names, field labels, error codes, API parameters). The vocabulary list covers the top 100. The remaining 300–500 are transcribed without biasing.
New content has new vocabulary. A custom vocabulary list is pre-configured. A product launch training session introduces product names that do not exist in the vocabulary list because the product was announced that day. Live captions on a product launch session will fail on the exact vocabulary that matters most for that session.
Contextual disambiguation is limited. Vocabulary biasing increases the probability that a phonetically matched term is produced. It does not use surrounding sentence context to disambiguate homophone pairs or near-homophones. The batch pipeline uses full document context to make these calls. The live pipeline cannot.
The ceiling for ASR-based live captioning. Industry practitioners working on live captioning systems estimate the practical accuracy ceiling for ASR-based live captions at approximately 92–94% under optimal conditions (single speaker, professional microphone, pre-loaded vocabulary, standard vocabulary density). For technical training content at realistic conditions, 80–88% is the typical observed range. No current ASR-based live captioning product consistently delivers 99%+ on L&D content without human review.

The implication: custom vocabulary is worth configuring as part of your live-session pre-event checklist (it improves the live session experience for attendees), but it is not an alternative to the post-processing step for archive compliance.

Building the two-stage captioning workflow

The two-stage workflow closes the archival gap without creating a manual review burden on the L&D team for every session. The key insight is that the two stages have different objectives and different tooling: Stage 1 serves live attendees (SC 1.2.4 compliance), and Stage 2 serves on-demand viewers (SC 1.2.2 compliance). They run sequentially, and neither can substitute for the other.

Stage 1: Live accessibility (before and during the session)

The objective of Stage 1 is to ensure every participant who joins the live session has access to real-time captions. This is the SC 1.2.4 requirement. The checklist for Stage 1:

Pre-event vocabulary configuration (once, per platform): If your platform supports custom vocabulary (Zoom, Webex), ensure the vocabulary list is configured with your organisation's highest-frequency technical terms. This is a one-time setup per platform, not a per-session task. Review and update quarterly or when a major product launches with new vocabulary.
Enable captions before the session starts: For Zoom, verify that AI Companion automated captions are enabled in account settings. For Teams, set captions to start automatically or ensure the meeting host enables them before the first speaker. For Webex, pre-enable captions in the meeting settings. Do not leave caption activation as an attendee's responsibility — many attendees who need captions for accessibility will not navigate the settings UI under time pressure.
Test the caption display: In the 5-minute window before the session starts, have the host or a tech producer join as a participant and verify that captions are rendering correctly. Test with a few product names from your current vocabulary list to confirm the vocabulary biasing is active.
CART for formal accommodations: If any attendee has a documented disability accommodation requiring CART quality (95–98% accuracy, named attendee), arrange CART through a provider for that session. CART is not required for all sessions — it is required for sessions where an accommodation specifies it. The per-session cost is $100–200/hour. Document the CART arrangement in the accommodation file.
Enable cloud recording: Confirm that the session is being recorded to cloud (not local). Cloud recordings produce the cleanest MP4 and are available to the post-processing pipeline without manual file transfer.

When Stage 1 is complete, the live session is accessible to all participants (SC 1.2.4 compliant) and a cloud recording is being generated for Stage 2.

Stage 2: Archive accuracy (post-event, before LMS publication)

The objective of Stage 2 is to produce a WCAG SC 1.2.2-compliant caption track for the recorded session before it is published as on-demand content. The checklist for Stage 2:

Download the cloud recording: Download the MP4 from Zoom, Teams SharePoint, Webex, or Google Drive. Note: Do not use the platform's auto-generated caption file (.vtt, .sbv) as the final caption track. Archive it for internal reference if useful, but it is not the SC 1.2.2 caption track.
Run through glossary-corrected captioning pipeline: Submit the MP4 to GlossCap with your company glossary loaded. The glossary should include your full term set — not just the 100 terms in the live-platform vocabulary list. GlossCap applies Whisper-large with glossary-biased decoding to the full recording, using the complete glossary context and the full recording duration as the disambiguation window. The output is a corrected .vtt at 99%+ accuracy on vocabulary covered by the glossary.
Review flagged segments (if any): GlossCap flags segments where confidence is below threshold — typically novel proper nouns not in the glossary, cross-talk segments, or audio quality drops. Review these segments (usually 2–5% of the total runtime for a well-configured glossary) and confirm or correct the transcription. This step is much faster than manual review of the full transcript: you are reviewing only flagged segments, not every line.
Upload corrected caption track to LMS: Replace the platform-generated caption track with the corrected .vtt. For Panopto and Mediasite, import via the transcript editor or captions API. For Kaltura, TalentLMS, and most LMS platforms, attach the .vtt as the caption track during video upload or via the media properties panel after upload.
Document the compliance record: Log the session, the platform, the recording date, the caption processing completion date, and the accuracy documentation (GlossCap provides a confidence score on the generated caption track). This documentation is the WCAG 2.1 AA SC 1.2.2 conformance record for this piece of content. Store it in your caption audit trail alongside the video metadata.

Stage 2 processing time is typically 0.5–2× real-time for machine processing, plus 15–30 minutes for human review of flagged segments on a 60-minute session. A practical SLA of 24–48 hours between live session and on-demand publication is achievable for most L&D teams with this workflow.

When the SLA matters most

The 24–48 hour publication SLA is appropriate for most on-demand content. Two situations require faster handling:

Formal ADA accommodations: If an employee has a documented accommodation requiring caption access to specific content within a shorter window (for example, an employee needs to complete a compliance module before a deadline and requires accurate captions to do so), the accommodation's terms govern — not the default SLA. Work with HR and legal to define an expedited processing path for accommodation-driven requests.
Live broadcast with simultaneous archive: Some organisations broadcast live sessions while also making the recording available on-demand immediately after the session ends (simulcast model). In this case, there is no gap between live and archive — the recording is on-demand from the moment the live session ends. This model requires either CART (which produces a near-SC 1.2.2-grade live transcript that can serve as the archive caption after minor review) or accepting that the archive will be temporarily non-compliant until Stage 2 processing completes. A practical policy for this model: publish the recording as "in progress — captions will be available within 24 hours" if the archive must go live immediately, rather than publishing with the non-compliant live caption track as the permanent caption file.

Integrating the two-stage workflow into your production process

The two-stage workflow works best when it is embedded in the standard post-session production checklist rather than treated as an exception process. The L&D production role (instructional designer, media producer, L&D coordinator) should have a documented checklist that includes Stage 2 as a standard step alongside editing and LMS metadata entry:

Download cloud recording from [platform]
Edit to remove pre-session dead time and post-session tail
Submit to GlossCap with [course-name] glossary
Review flagged segments and confirm transcript
Export corrected .vtt
Upload MP4 + corrected .vtt to LMS
Log to caption compliance tracker (session, date, confidence score)
Publish module

Treating the captioning step as one line in the existing production checklist — rather than a separate accessibility workflow — is the most reliable way to prevent the step from being skipped under time pressure. The goal is that caption correction is as automatic as editing out the dead time at the start of the recording.

Compliance implications by regulatory framework

The dual-standard structure of WCAG (SC 1.2.4 for live, SC 1.2.2 for prerecorded) maps onto the regulatory frameworks that govern L&D content differently depending on the organisation's type and the content's audience.

ADA Title II (state/local government, public universities)

ADA Title II requires effective communication for people with disabilities in programs and activities of covered entities. For training video content at public universities and state/local government agencies, both SC 1.2.4 (live class sessions, synchronous training) and SC 1.2.2 (recorded lectures, on-demand training) apply. ADA Title II's April 2026 digital accessibility compliance deadline explicitly references WCAG 2.1 AA as the technical standard for web content accessibility, which includes both SC 1.2.2 and SC 1.2.4.

For public universities, the primary risk surface is the LMS library of archived lecture recordings and course videos. These are SC 1.2.2 content. Live class sessions are SC 1.2.4. Universities that provide live captions during class but do not correct the archived recordings before posting them to the LMS are compliant for the live session and non-compliant for the archived content — a compliance gap that grows with every semester of recordings added to the library.

Section 508 (federal agencies and contractors)

Section 508 requires that federal electronic and information technology, including training and educational content, be accessible to people with disabilities. The technical standard for Section 508 is WCAG 2.0 Level AA (more recently interpretations cite WCAG 2.1 AA by reference). For video content, the same dual-standard logic applies: live training sessions require SC 1.2.4 compliance; archived training content requires SC 1.2.2 compliance. The Section 508 compliance matrix for training teams covers the specific application.

ADA Title I (private employers, 15+ employees)

ADA Title I requires reasonable accommodations for employees with disabilities. A hearing-impaired employee who needs accurate captions to access training content required for their job (onboarding, compliance training, product enablement) can request a caption accommodation. The accommodation standard is that the captions must be accurate enough for the employee to access the training content effectively — not just technically present. Providing a 70% accuracy caption track on a compliance training module that an employee must complete is not a reasonable accommodation. The SC 1.2.2 standard (99%+) is the appropriate accuracy target for accommodating caption requests under ADA Title I.

EAA (European Accessibility Act, B2C digital products in the EU)

The European Accessibility Act, enforceable since June 2025, applies to digital products and services sold in the EU. For software products that include training video content (onboarding flows, help videos, product tours), the EAA requires WCAG 2.1 AA compliance. Training content for customers in the EU must meet SC 1.2.2 for prerecorded video. Customer-facing training video is the highest-risk surface for EU-based or EU-selling organisations.

Contractual accessibility requirements

Many enterprise contracts now include accessibility warranty clauses requiring that the software product (and its support materials, including training video) meet WCAG 2.1 AA. A VPATs (Voluntary Product Accessibility Templates) claim of WCAG 2.1 AA compliance that covers training video must accurately represent the caption quality of that video. If training video is listed as WCAG 2.1 AA conformant in a VPAT but the caption tracks are live-generated .vtt files at 72% accuracy, the VPAT claim is inaccurate and creates contractual liability.

The Panopto and Mediasite case in depth: the lecture-capture specific problem

Lecture-capture platforms present a slightly different version of the live-vs-recorded problem because they do not produce live captions at all — and this creates a double-exposure for institutions that rely on them.

The lecture-capture gap

When a professor delivers a lecture in a Panopto-captured classroom, there are no captions visible during the lecture for students attending in person or joining via the room's live stream. The captions are generated post-upload — which means the only caption track is an archived recording caption track. This simplifies the WCAG analysis (there is no SC 1.2.4 question — there is no live caption to configure) but concentrates the compliance requirement entirely on SC 1.2.2.

The problem is that Panopto's post-upload ASR is calibrated for throughput, not accuracy. For an institution running 2,000 lecture recordings per semester, Panopto's automated captioning gets a first-pass transcript on all 2,000 recordings without human intervention — but at 63–73% accuracy on technical vocabulary. For a pharmacology lecture, a programming course, a legal theory seminar, or an engineering design review, that accuracy level means hundreds of errors per recording, on exactly the vocabulary that defines the course's subject matter. For hearing-impaired students relying on captions to access course content, a 65% accuracy caption track on a biochemistry lecture is not an accommodation — it is noise with occasional recognisable words.

The institutional-scale challenge

The structural challenge for universities using Panopto or Mediasite at scale is that the accuracy problem is not limited to a few recordings — it applies to the entire library of automatically captioned content. A university that has 10 years of Panopto recordings with ASR captions has a back catalogue of SC 1.2.2 non-compliance that grew while the institution believed it was meeting the captioning standard.

The triage approach for institutional back-catalogue remediation: prioritise by compliance urgency (active course content for enrolled students with disability accommodations first, then active course content generally, then archived recordings with historical access only) and by accuracy impact (courses with high technical vocabulary density — STEM, medical, legal — have the largest accuracy gap and the highest accessibility impact). The caption compliance program post covers the triage framework in detail.

The hybrid case: Zoom + Panopto in the same workflow

Many institutions run a hybrid workflow: the live class session happens on Zoom (with Zoom live captions), and Panopto captures the Zoom recording for the LMS library. This creates three caption tracks:

Zoom live captions during the session (SC 1.2.4 — ~82–88% accuracy, compliant for live access)
Zoom cloud recording .vtt (auto-generated from live captions — ~82–88% accuracy, NOT compliant for SC 1.2.2 archive use)
Panopto post-upload ASR (run on the Zoom recording after Panopto ingestion — ~80–87% accuracy, NOT compliant for SC 1.2.2)

In the worst case, the Zoom .vtt and the Panopto ASR both exist and neither is compliant. In a common case, only one of the two exists and is used as the archive caption track despite being non-compliant. The solution is the same in either case: run the Panopto-ingested recording through a corrected captioning pipeline with a course-specific glossary, import the corrected .vtt into Panopto, and discard the prior caption tracks.

The AI-generated training video edge case

One category of training video has no "live" phase at all but creates a specific captioning challenge that is worth addressing in the context of this post: AI-avatar training videos produced with tools like Synthesia, HeyGen, Descript, or Pictory.

AI avatar videos are born recorded — they are created by rendering text-to-speech (TTS) audio against a synthetic avatar, exported as an MP4, and uploaded directly to the LMS without a live session phase. They fall entirely under SC 1.2.2 from creation. The complication: TTS-generated voice has prosody, pacing, and acoustic characteristics that differ from natural human speech in ways that degrade ASR accuracy. TTS voice is typically flatter in intonation, uses consistent pacing without the natural hesitations and speed variations of live speech, and may produce some phonemes with characteristic artifacts of the synthesis model. ASR systems trained on natural human speech — including Whisper and the commercial ASR APIs used by Panopto and Zoom — show accuracy degradation of approximately 5–8 points on TTS audio relative to human-narrated audio with equivalent vocabulary content.

The practical consequence: if you generate a Synthesia training video about your SaaS product, the TTS voice reads the script cleanly and the video looks professional, but the ASR pipeline that generates the caption track is producing output at lower accuracy than it would on a human-narrated equivalent. You have pre-recorded content (SC 1.2.2 applies) that is harder than average for auto-caption tools to handle. Using the caption track that Synthesia generates from its TTS script is the right approach — Synthesia has access to the original script text and can produce a near-perfect caption file. Using a post-upload ASR on a Synthesia export is the wrong approach, for exactly this reason.

FAQ

If we record a live session but upload it to the LMS with no caption track, are we non-compliant?

Yes, for SC 1.2.2. The absence of captions on prerecorded synchronised media is a clear failure of SC 1.2.2. The fact that captions were present during the live session (SC 1.2.4) does not substitute for captions on the archived recording. The live session and the archived recording are two separate pieces of content under two separate WCAG success criteria. Compliance with SC 1.2.4 during the live event does not carry over to the recording.

If we upload the .vtt from our Zoom recording to the LMS, are we compliant?

Coverage: yes (a caption track is present). Accuracy compliance: almost certainly not. The Zoom live-generated .vtt has approximately 82–88% accuracy on standard English and 65–75% on technical training vocabulary. WCAG SC 1.2.2 requires 99%+ accuracy. You have met the coverage criterion and failed the accuracy criterion. A compliance audit that examines the actual caption track (not just whether one exists) will find the discrepancy. The .vtt needs to pass through a corrected captioning pipeline before it is suitable as the permanent archive caption track.

What is the difference between CART and AI captioning for live sessions?

CART (Communication Access Realtime Translation) is live captioning provided by a trained human stenographer using specialised equipment (steno machine + CAT software) or voice writing, connected to the session via a captioner login. CART produces 95–98%+ accuracy in real time — including on technical vocabulary, because the CART provider can be briefed on subject matter in advance. Cost: $100–200/hour, billed in minimums (typically 2-hour minimums). AI live captioning is ASR-based, included in most platform licenses (Zoom AI Companion, Teams, Webex) at no per-session charge, and produces 80–90% accuracy on standard English with lower accuracy on technical vocabulary. For routine training sessions, AI captioning meets SC 1.2.4. For formal disability accommodations, legal proceedings, executive communications, or any session where a named individual requires near-verbatim accuracy live, CART is the appropriate choice.

Can we meet 99% accuracy on live captions without post-processing?

Not reliably with current ASR technology at scale. CART approaches 98%+ live for trained stenographers, but at a cost that makes it impractical as a universal live captioning approach for all L&D sessions. ASR-based live captioning has an observed practical ceiling of approximately 92–94% under optimal conditions (single speaker, professional microphone, pre-loaded vocabulary, standard vocabulary density) and produces 80–88% on typical L&D content at typical recording conditions. No currently available ASR-based live captioning product consistently produces 99%+ on technical training content without human review. The practical answer for L&D is: accept live captioning at its achievable accuracy for live accessibility, and require post-processing to reach the 99%+ threshold for archived content.

What policy language should we use to govern live-vs-recorded captions?

Recommended policy language for a captioning policy document covering both standards: "All live training sessions hosted via [platform list] shall have automated live captions enabled prior to session start. Sessions requiring formal accessibility accommodations shall use CART captioning as specified in the accommodation. All recordings of live training sessions shall be processed through the organisation's captioning quality workflow (achieving WCAG 2.1 AA SC 1.2.2 compliant accuracy of 99%+) before publication to the LMS or any on-demand channel. Recordings shall not be published as on-demand content without a compliant caption track. The target SLA from session end to on-demand availability is [24/48] hours." This language covers SC 1.2.4 (live captions enabled), SC 1.2.2 (recordings corrected to 99%+), the CART accommodation path, and the publication SLA in a single policy section.

Does WCAG 2.2 change anything about live vs recorded captions?

WCAG 2.2 (October 2023) did not modify SC 1.2.2 or SC 1.2.4. Both success criteria remain substantively unchanged from WCAG 2.1. The accuracy expectations, scope definitions, and the distinction between live and prerecorded content are the same in both versions. If you are targeting WCAG 2.1 AA compliance (the current ADA and Section 508 technical standard), the guidance in this post applies directly. WCAG 2.2 compliance requires meeting all 2.1 criteria plus the new 2.2 criteria — the captioning standards are not among the changes.

How do we handle a recording that is both a live event archive AND new course content added to by an instructional designer?

Treat each distinct media segment by its origin. A course that includes (a) a recorded live webinar segment and (b) separately recorded instructional video segments has two SC 1.2.2 compliance requirements — one for each segment, each needing its own corrected caption track. The corrected caption tracks can be embedded in a single .vtt for the full course if the LMS supports chapter-level captioning, or treated as separate video assets each with their own caption file. The key point: the live-webinar segment does not inherit compliance from the instructional-designer-recorded segments. Each segment needs its own SC 1.2.2-grade caption track.

Are AI-powered captions in newer platform versions (like Microsoft Copilot) different from standard ASR live captions?

Microsoft Teams with Copilot does not fundamentally change the live captioning accuracy picture. Copilot adds meeting summarisation, action item extraction, and question-answering on meeting content — these are post-meeting analysis features that operate on the transcript, not real-time improvements to the live caption accuracy during the meeting. The live caption accuracy in a Teams meeting with Copilot is approximately the same as without Copilot. Similarly, Zoom AI Companion's expanded features (meeting summaries, action items, chat assistance) do not change the accuracy of the live captioning stream. The underlying ASR model for real-time live captions is constrained by the same latency and context-window limitations regardless of what AI analysis features are added on top of it.

Close the archival gap with GlossCap

GlossCap is built specifically for L&D teams that produce live training sessions and need those recordings to meet WCAG 2.1 AA SC 1.2.2 accuracy in the LMS. Upload the Zoom, Teams, Webex, or Google Meet recording MP4, load your company glossary (from Notion, Confluence, Google Docs, or a pasted term list), and GlossCap returns a corrected .vtt at 99%+ accuracy on your technical vocabulary — ready to attach to the LMS module. The same pipeline works for Panopto and Mediasite recordings that need archive-quality captions. No manual line-by-line editing. No custom vocabulary cap at 100 terms. The full glossary context across the full recording, not a four-second live window.

See pricing How GlossCap works