Live Training Operations · Published 2026-06-15
Live training captioning playbook: CART providers, real-time ASR, the vILT platform workflow, and the post-event recording that needs its own caption track
Every asynchronous training video an L&D team produces has a consistent caption workflow: the video is finished, a caption file is generated or commissioned, the caption file and video are uploaded to the LMS together. There are quality problems in this workflow — accuracy on technical vocabulary, timing synchronization, LMS delivery failures — but the workflow itself has a single deliverable (the video with a caption sidecar) and a known sequence. The existing GlossCap blog corpus covers this workflow in depth across every major LMS platform, content type, and compliance framework.
Live instructor-led training (ILT) and virtual instructor-led training (vILT) have a fundamentally different caption architecture. The audience is present in real time, which means the caption requirement exists during the session — before any post-production step is possible. The session will also likely be recorded, creating a second compliance obligation: the recording becomes a prerecorded video, and prerecorded video under WCAG 2.1 SC 1.2.2 requires accurate captions before it is made available to the async audience. The live session and the recording are two legally distinct media assets with two distinct caption requirements, and the captioning approach that serves one does not automatically serve the other.
This creates a set of operational problems that L&D teams routinely underestimate. CART (Communication Access Realtime Translation) providers must be booked in advance — two to five business days for standard rates, twenty-four hours for emergency rates. Real-time ASR in Zoom, Teams, Google Meet, and BigBlueButton provides instant availability but at accuracy levels of 72–90% on technical training vocabulary — below the 99% WCAG threshold and often below the "effective communication" standard for ADA live-event compliance. The multi-speaker Q&A portion of any ILT session is the hardest scenario for any captioning approach: audience members without microphones, rapid turn-taking, speaker identification failures, and the compounding effect of live acoustic environments. And the post-event recording — the step L&D teams most commonly get wrong — requires a new caption track derived from the session audio, not the real-time caption stream, because the real-time caption timestamps do not align with the recording timestamp.
This guide covers the complete operational picture for live training captioning: the three approaches and when to use each, the CART provider sourcing and briefing protocol, platform-by-platform real-time ASR accuracy and setup for Zoom, Teams, Google Meet, Webex, BigBlueButton, and Adobe Connect, the multi-speaker Q&A failure modes, the post-event recording workflow as a separate compliance deliverable, vILT platform LMS integration for the recording, in-room ILT display and audio setup, the compliance framework across ADA Title I, ADA Title II, Section 508, WCAG 2.1 SC 1.2.4, the EAA, and AODA, glossary pre-loading for CART providers, eight failure modes, and a seven-question FAQ.
TL;DR — six things every L&D team running live training needs to know
- Live session captions and post-event recording captions are two separate compliance deliverables. The real-time caption stream (CART or ASR) serves attendees during the session. The recording requires its own caption track — the real-time timestamp alignment does not transfer to the recording timecode. Most vILT programmes fail on the recording, not the live session, because teams assume the recording is automatically captioned by the live stream.
- CART must be booked in advance. CART provider standard booking lead time is two to five business days. Emergency rates (24-hour booking) cost a 50–100% premium. Sessions that rely on ad-hoc CART availability will encounter provider unavailability. The operational requirement is to include CART booking as a step in the session scheduling workflow, not as a day-before afterthought.
- Real-time ASR in Zoom, Teams, and Google Meet does not meet 99% WCAG accuracy on technical training content. Platform ASR accuracy on technical L&D vocabulary ranges from 72–85%, depending on the platform and content type. This is below the 99% threshold for prerecorded video and below best-practice standards for live session accessibility. ASR is a reasonable Tier 2 fallback when CART is unavailable — not a primary compliance strategy.
- The multi-speaker Q&A problem requires a protocol, not just a caption approach. Unmic'd audience questions, simultaneous speaking, and rapid speaker transitions degrade both CART and ASR accuracy. The fix is procedural — all speakers identify themselves, questions are repeated by the moderator, and unmic'd speakers are asked to use a microphone or chat. These protocols must be scripted into the session facilitation plan, not improvised.
- CART provider briefing three days before the session is the single highest-impact action you can take. Providing the CART provider with speaker names, a topic outline, a vocabulary list with phonetic spellings for technical terms, and an acronym expansion list allows them to pre-load your organizational terminology into their CAT software. The difference between a briefed CART provider and an unbriefed one on a technical DevOps training session is 15–25 percentage points of accuracy on domain vocabulary.
- Microsoft Teams with Azure Custom Speech is the highest-accuracy real-time ASR option for enterprise users. Teams Live Captions through Azure Cognitive Services, when configured with a Custom Speech model for your organization's vocabulary, achieves 83–90% accuracy on technical training content — the best baseline among platform-native ASR options. This requires IT admin Azure subscription access and a one-time custom vocabulary file upload, but it is the most practical path to improving ASR quality without CART costs for every session.
What makes live training captioning different from async video
The real-time constraint eliminates post-production correction
Every caption quality intervention in async video captioning happens after the fact: the audio is processed, the transcript is generated, the glossary is applied, the timing is verified, the QA pass is run, and the corrected SRT file is uploaded to the LMS before the video is made accessible. None of these steps are available during a live session. When a CART provider misspells a term during a Zoom vILT session, the participant who reads that error in real time has already seen an incorrect caption. There is no correction loop that applies before the audience sees the output.
This means the quality gate moves upstream. For async video, quality is enforced in the production workflow — QA methodology applied to the caption file before LMS upload. For live training, quality is enforced in the preparation workflow — the CART briefing, the vocabulary pre-load, the session facilitation protocol, and the platform audio setup. Getting these right before the session is the only opportunity to influence live caption quality. The fundamental accuracy difference between live and prerecorded caption contexts is precisely this: live captioning has no correction window; prerecorded captioning has a correction window that can be as long as the organisation chooses to allow.
Two compliance deliverables, not one
An ILT or vILT session that is recorded creates two distinct caption obligations:
Deliverable 1: Real-time captions for live attendees. Present during the session, serving participants who are deaf or hard of hearing, or who benefit from reading along. Governed by ADA Title I "effective communication" requirements for employer-to-employee training, ADA Title II for public entities, Section 508 for federal agencies, and WCAG 2.1 SC 1.2.4 (Captions — Live) for organisations targeting WCAG AA conformance. This deliverable disappears when the session ends — it was never stored as a file; it was a real-time stream to participants' screens.
Deliverable 2: Caption file for the recording. After the session, the recording must carry a caption track before it is made available to the async audience. The recording is now a prerecorded video, and prerecorded video under WCAG 2.1 SC 1.2.2 (Captions — Prerecorded) requires captions that meet the 99% accuracy standard — the same requirement as any other training video in the LMS. This deliverable must be created after the session, using the session audio as input, and uploaded to the LMS as a sidecar before the recording is made accessible.
The error most L&D teams make is treating these as one deliverable. They book a CART provider, the CART provider generates a real-time stream during the session, and then — the common assumption — the CART transcript becomes the recording caption file automatically. It does not. The real-time CART stream is time-stamped to the clock time of the session (e.g., timestamps beginning at 14:00:00 for a 2pm session). The recording begins at 00:00:00. The timestamps do not align. A separate alignment and SRT-export step is required, which must be performed by the CART provider or the L&D team after the session. Organisations that skip this step either have no captions on their vILT recordings, or they have auto-captions generated by LMS ASR — which reintroduces the accuracy problem that CART was used to solve.
The provider-booking dependency
Async video captioning is on-demand: you upload audio, the system processes it, and you have a draft caption file in minutes to hours. Live session captioning with CART is schedule-dependent: you need to find an available CART provider, book them, and brief them in advance. Standard CART provider booking requires two to five business days of lead time. For recurring training programmes — weekly onboarding sessions, monthly all-hands Q&A calls, quarterly leadership training series — this means building CART booking into the session scheduling workflow at the programme level, not as a per-session logistics afterthought.
This scheduling dependency creates a fragility that async video captioning does not have. If a session is rescheduled on short notice, the CART provider may not be available at the new time. If the CART provider has a day-of emergency, a replacement must be found within hours. The operational answer is to establish a preferred CART vendor relationship with multiple available providers, not to rely on marketplace sourcing per session. The captioning RFP playbook applies to CART provider selection as much as to post-production vendors — and the contract provisions for CART include backup-provider clauses and cancellation protocols that matter specifically because live sessions cannot be cancelled at the last minute the way a video captioning order can.
The multi-speaker environment
Async training video is almost always single-speaker: one narrator records the module. The caption workflow is optimised for single-speaker audio. Live training is structurally multi-speaker: an instructor or facilitator presents, participants ask questions verbally, panel discussions involve four or more simultaneous contributors, breakout session debrief involves multiple team members reporting out. Each additional speaker adds complexity to both CART and ASR captioning — speaker identification requires more cognitive bandwidth from a CART operator, and ASR speaker diarization degrades with rapid speaker transitions and overlapping speech.
The multi-speaker problem is compounded by the microphone architecture of live training. In a vILT session, most participants are on their own audio devices with varying microphone quality; in an in-room ILT session, the room may have one or two microphones serving an audience of thirty people. Audience members who ask questions without a microphone — the most common case in in-room training — produce audio that is too quiet for accurate ASR and too muffled for CART operators to hear clearly via a remote audio feed.
The three captioning approaches: CART, real-time ASR, and hybrid
CART (Communication Access Realtime Translation)
CART is human stenographic captioning delivered in real time. A trained CART provider (a stenographer certified in CART captioning) listens to the session and types using steno software at speeds of 225–280 words per minute, which is fast enough to keep up with natural speech in real time. The typed steno output is converted by CAT (Computer-Assisted Translation) software into readable text and streamed to the session participants' displays — on a dedicated screen for in-room ILT, or via a URL feed (StreamText, StreamText.com, or similar platform) for vILT participants.
CART accuracy on well-briefed sessions: 98–99%+. The "well-briefed" qualifier is critical — an unbriefed CART provider working with a technical DevOps training module or a compliance training session heavy in regulatory acronyms will produce 85–92% accuracy on domain vocabulary. Briefed with the organizational vocabulary list (see glossary section), accuracy on those same terms rises to 97–99%+. General conversational vocabulary is typically 99%+ regardless of briefing. The accuracy gap between CART and real-time ASR is most pronounced on domain vocabulary — the terms that matter most in L&D content.
CART cost structure (US market, 2025–2026): remote CART providers (the CART operator works from their own location, receiving the audio feed via phone or Zoom) typically charge $100–$175 per hour with a one-hour minimum. Onsite CART providers (the operator is physically present in the training room) charge $175–$350 per hour plus travel expenses. Half-day and full-day rates are available from most agencies. Emergency/same-day rates typically carry a 50–100% premium over standard rates. For recurring programmes with weekly or monthly sessions, annual contracts with CART agencies provide predictable pricing and guaranteed availability.
CART provider types: individual CART providers (freelancers listed on NCRA's member directory, ACPCART.com, or their own websites), CART agencies (which aggregate individual providers and handle scheduling, backup, and billing — better for enterprise volume and reliability), and full-service captioning vendors (Rev, 3Play Media, Verbit, AI-Media, and others that offer CART services alongside post-production captioning — useful for organizations that want a single vendor for both live and async captioning).
Real-time ASR (platform-native automatic speech recognition)
Every major vILT platform (Zoom, Teams, Google Meet, Webex, BigBlueButton, Adobe Connect) offers real-time ASR captions natively, at no per-session cost beyond the base platform license. The accuracy on standard conversational speech is typically 82–92%, depending on the platform and audio quality. The accuracy on technical L&D vocabulary — regulatory acronyms, product names, clinical terminology, engineering identifiers — drops to 65–85% without vocabulary customization.
The Whisper accuracy benchmarks by vertical cover the baseline ASR accuracy that underlies most platform captioning (Whisper is the engine behind many platform caption systems). Live ASR has additional accuracy challenges beyond the static benchmark: real-time processing has less context window than post-processing (the model processes in shorter chunks with less surrounding context to disambiguate), audio quality from live microphones is typically lower than studio-recorded audio, and multi-speaker environments introduce speaker-confusion errors that rarely appear in single-speaker post-production captioning.
The compliance status of auto-captions for WCAG and ADA covers the regulatory position in detail: platform ASR, at current accuracy levels, does not meet the 99% WCAG 2.1 AA accuracy threshold for prerecorded video under SC 1.2.2, and does not meet best-practice standards for live session accessibility under SC 1.2.4. Using platform ASR as the primary and only captioning approach for a live training session that includes deaf or hard-of-hearing participants is an ADA risk for both Title I and Title II organizations.
Platform ASR is nonetheless valuable as a Tier 2 fallback (when CART is unavailable), as a supplementary display for hearing participants who benefit from reading along, and as the starting point for post-event recording caption processing. The key is not to rely on it as the compliance solution for live sessions where hearing-impaired participants are present.
Hybrid: CART for live session, ASR for supplementary display, CART transcript for recording
The hybrid approach uses a CART provider to deliver the primary real-time caption stream, platform ASR as a secondary display (available to all participants without requiring the CART stream URL), and the CART operator's post-session transcript as the source for the recording caption file. This approach provides CART-level accuracy to live session participants who access the CART stream, ASR-level accessibility to participants who use the platform's native caption display, and a CART-quality caption source for the post-event recording without requiring a separate post-production captioning step.
The hybrid approach adds operational complexity — participants need to know to access the CART stream URL rather than relying on the platform's native captions, and the session coordinator must manage two caption systems simultaneously. But it is the most effective approach for technically demanding training content where post-event recording accuracy is important, because the CART transcript (after timestamp alignment) is already at 98–99% accuracy before any glossary correction step.
| Approach | Live accuracy (technical content) | Cost | Lead time | Recording caption source |
|---|---|---|---|---|
| CART (remote) | 97–99%+ (briefed) | $100–$175/hour | 2–5 business days | CART transcript (alignment required) |
| CART (onsite) | 98–99%+ (briefed) | $175–$350/hour + travel | 2–5 business days | CART transcript (alignment required) |
| Real-time ASR (Zoom) | 72–82% (technical) | Included in Zoom license | None (instant) | ASR output or re-process recording audio |
| Real-time ASR (Teams) | 75–85% (standard); 83–90% (Custom Speech) | Included in Teams license; Azure Custom Speech adds cost | None (instant) | Teams transcript or re-process recording audio |
| Real-time ASR (Google Meet) | 72–80% (technical) | Included in Google Workspace | None (instant) | Re-process recording audio |
| Hybrid (CART + ASR) | 98–99%+ (CART stream) | CART cost + platform license | 2–5 business days (CART) | CART transcript (alignment required) |
CART provider sourcing, booking, briefing, and day-of protocol
Sourcing: where to find CART providers
The four sourcing channels for CART providers are: individual freelance providers, CART agencies, full-service captioning vendors, and the GSA schedule (for federal agencies).
Individual freelance providers are typically listed on NCRA's (National Court Reporters Association) member directory, ACPCART.com (the Association of Certified Practitioners in Captioning and CART), and their own professional websites. Individual providers offer the most direct working relationship and often develop deep familiarity with a client's vocabulary over time. The limitation is scheduling availability — a single provider may not be available for recurring weekly sessions, and illness or personal emergency on the day of the session leaves no backup.
CART agencies aggregate multiple individual providers, handle scheduling across their roster, and maintain backup coverage when a primary provider is unavailable. Agencies typically charge a coordination fee (10–25% above individual provider rates) but provide scheduling reliability that individual providers cannot. For recurring training programmes with defined session calendars, a CART agency relationship is the operationally sound choice. Key questions when evaluating CART agencies: average provider accuracy on technical content, backup coverage protocol for day-of cancellations, turnaround for post-session transcript delivery, and glossary pre-loading process.
Full-service captioning vendors (companies that provide both post-production captioning and CART services) are useful for organizations that want consolidated billing and a single vendor relationship for live and async captioning. Rev, 3Play Media, Verbit, and AI-Media all offer CART services in addition to their post-production offerings. The advantage is a unified glossary architecture — the vocabulary you maintain for post-production captioning can be applied to CART briefings from the same vendor. The limitation is that full-service vendors sometimes outsource their CART provision to third-party providers, adding a coordination layer that can reduce briefing-to-provider fidelity. Verify whether the vendor's CART provision uses in-house staff or a contractor network, and confirm the briefing-to-provider handoff process.
GSA Schedule 70 (IT Services) includes CART and captioning services, which federal agencies can use for competitive procurement. State and local government agencies subject to ADA Title II can reference the GSA pricing as a benchmark even when they are not GSA-eligible.
Booking: timeline and lead-time requirements
Standard booking lead time for CART providers is two to five business days. This lead time covers: provider scheduling confirmation, receipt of the session briefing materials, CAT software vocabulary pre-loading, and any coordination for special requirements (remote audio setup, StreamText platform configuration, multi-language needs).
For emergency booking (24–48 hours), most CART agencies maintain an emergency rate schedule at 50–100% above standard rates. Emergency bookings should be planned for: CART provider day-of illness with an agency that provides backup coverage, a session that was added to the calendar without following the standard scheduling workflow, or an accommodation request received close to the session date.
Building CART booking into the session scheduling workflow: the most reliable approach is to establish a programme-level CART reservation for recurring sessions — booking the CART provider for all sessions in a series at the start of the series, with a cancellation window (typically 48–72 hours) for individual sessions that are rescheduled. This eliminates per-session sourcing friction and ensures provider availability for the entire programme calendar.
Briefing: what to provide and when
The CART provider briefing is the highest-leverage preparation step. A well-briefed CART provider on a technical training session achieves 97–99%+ accuracy on domain vocabulary; an unbriefed provider on the same session achieves 85–92%. The briefing materials should be delivered two to three business days before the session (not the morning of the session — the provider needs time to pre-load vocabulary into their CAT software before the session date).
Required briefing materials:
Speaker names and titles. Full names (including pronunciation of any non-English names), job titles, organization names. CART operators can pre-load speaker names into their CAT macros to reduce errors when a name is repeated frequently during a session.
Session topic outline. A one-page overview of the session structure: main topics, subtopics, and approximate timing for each segment. This helps the CART provider anticipate vocabulary domains (a compliance training session will use different domain vocabulary than a DevOps technical training) and prepare transitions between speakers and topics.
Vocabulary list with phonetic spellings. This is the most impactful briefing element. The list should include: product names, SDK identifiers, regulatory acronyms, clinical terminology, organizational proper nouns, and any specialized vocabulary that the session instructor uses frequently. For each term, include the phonetic spelling if the pronunciation is not obvious from the spelling. Examples: "OAuth" (pronounced "oh-auth"), "SCIM" (pronounced "skim"), "kubectl" (pronounced "kube-control"), "HIPAA" (pronounced "hip-uh"), "Panopto" (pronounced "puh-nop-toe"). The phonetic spelling is what the CART operator's CAT software uses to match steno strokes to the correct term output.
Acronym expansion list. Every acronym used in the session, with its full expansion and the context in which it appears. "LMS" expands to "Learning Management System"; "SCORM" expands to "Sharable Content Object Reference Model"; "DCMP" expands to "Described and Captioned Media Program." CART operators cannot expand acronyms automatically without this list — they will either type the acronym as-is or guess the expansion incorrectly.
Presentation materials. If the session uses a slide deck, share it. CART operators use slide content as context to anticipate what vocabulary will appear next — a slide with "WCAG 2.1 AA SC 1.2.2" on it tells the operator to expect those identifiers in the next minute of audio. If the slide deck is confidential, share only the outline or a non-confidential version.
The organizational glossary architecture post covers how to build and maintain the vocabulary list that feeds into CART briefings — the same glossary that drives accuracy improvement in post-production captioning is the source material for the CART briefing list.
Day-of protocol: audio setup, stream delivery, and coordinator role
The day-of setup for remote CART involves four components: audio feed quality, CART stream delivery, session coordinator communication, and CART display for participants.
Audio feed for remote CART. The most reliable audio feed for a remote CART provider is a dedicated phone bridge — a separate phone line that the CART operator joins, receiving the session audio via phone (not the vILT platform audio). Phone bridges have more consistent audio quality than Zoom or Teams microphone audio, which varies with participant hardware. For vILT sessions where phone bridge is impractical, the CART provider should be added to the session as a silent participant with audio-only access, and a room speaker (or platform audio) should be directed to the provider. The CART provider should test the audio feed 10–15 minutes before the session starts.
CART stream delivery. For vILT sessions: the CART provider sends participants a StreamText URL (or equivalent platform) where the real-time caption stream is accessible in a browser. This URL should be distributed to participants in the session invitation and repeated in the session chat at the start. For in-room ILT: the CART stream can be displayed on a dedicated monitor positioned so captioned content is visible to the audience (not the same projector screen as the presentation) or delivered to individual devices (tablets, laptops) for participants who need the accommodation.
Session coordinator communication. The coordinator (whoever is managing the session logistics) should maintain a direct private channel (chat, Slack, text message) with the CART provider during the session. This channel is used for: session start confirmation, speaker changes (alert the CART provider when a new speaker is taking over), breaks (so the CART operator can pause), Q&A windows (so they can prepare for multi-speaker input), and any technical issues (audio drops, platform changes).
Post-session deliverable SLA. Confirm the CART provider's post-session transcript delivery SLA before the session. Standard turnaround for the aligned transcript file (formatted for post-event recording SRT export) is 24–48 hours. If the recording will be published the same day as the session, negotiate a same-day transcript delivery option at session booking.
Real-time ASR by platform: accuracy, setup, and configuration
Zoom: AI Companion and native live captions
Zoom provides real-time captions through two mechanisms: the "Closed Captioning" feature (which can be manually typed by a designated host participant or auto-generated by Zoom's AI) and Zoom AI Companion (automatic transcription and summary, available with add-on license). Zoom's real-time ASR for captions uses a proprietary speech recognition model. Accuracy on standard conversational speech: 83–88%. Accuracy on technical L&D vocabulary: 72–82%.
Enabling real-time captions in Zoom: Account Administrator must enable "Automated Captions" in the Zoom Admin portal (Account Settings → Meetings → "Automated Captions"). Once enabled, meeting hosts can toggle captions on during the session. Participants can show/hide the caption display in their Zoom client. The caption display in Zoom shows approximately 2–3 lines of real-time text at the bottom of the video area.
Custom vocabulary in Zoom: Zoom's native ASR does not support a real-time custom vocabulary model (as of mid-2026). The practical options for improving Zoom live caption accuracy on technical content are: (1) use a third-party real-time ASR integration via Zoom Apps (Rev.ai's Zoom integration supports glossary configuration for live captions); (2) accept baseline ASR accuracy for the live session and re-process the recording through a higher-accuracy workflow for the post-event caption file.
Zoom recording and caption extraction: Zoom Cloud recordings store a transcript (.vtt file) when AI Companion or full transcript recording is enabled. This .vtt file is accessible from the Zoom portal's recording page. The .vtt transcript reflects the real-time ASR output — accuracy is the same as the live session. For the post-event recording, this .vtt file provides a starting point, but it will require accuracy correction (glossary application, error review) before it meets the 99% WCAG standard. See the Zoom caption workflow for training videos for the full LMS delivery process.
Microsoft Teams: Live Captions and Azure Custom Speech
Teams provides real-time Live Captions (in-meeting caption display) and Meeting Transcription (full searchable transcript). Both use Azure Cognitive Services speech recognition. Teams is the highest-accuracy platform-native ASR option for enterprise users, partly because Azure Custom Speech allows vocabulary customization.
Accuracy benchmarks: Teams Live Captions on standard conversational speech: 87–92%. On technical L&D vocabulary (engineering, compliance, medical) without vocabulary customization: 75–85%. With Azure Custom Speech model configured: 83–90% on technical content — the best baseline among platform-native options.
Enabling Live Captions: Teams Meeting hosts enable captions from the meeting controls (More actions → Language and speech → Turn on live captions). Meeting participants can turn on their own caption display. Language detection is automatic. Transcription (full transcript for the entire meeting, not just real-time display) must be enabled separately by the meeting organizer.
Azure Custom Speech configuration: this is the IT admin step that significantly improves Teams ASR accuracy on domain vocabulary. The process: an IT admin with Azure subscription access creates an Azure AI Speech resource, uploads a custom vocabulary file (a text file containing organizational terms, proper nouns, and phrases — one per line), and links the Azure Speech resource to the Microsoft 365 tenant. Custom Speech models process the same session audio but with the custom vocabulary as a weighting bias, improving recognition of terminology that appears infrequently in the general training corpus. For L&D teams with consistent domain vocabulary, the Custom Speech configuration is a one-time IT setup that improves all subsequent Teams sessions.
Teams transcript for recording captions: the Teams Meeting transcript (.vtt) is stored in the organizer's OneDrive or SharePoint (depending on configuration). The recording itself is stored in OneDrive/SharePoint. When a Teams recording is downloaded and uploaded to an LMS, the .vtt transcript file does not automatically come with it — it must be separately downloaded from the Teams recording page and uploaded to the LMS as a caption sidecar. The transcript accuracy in the downloaded .vtt reflects the live session ASR quality, which requires post-processing correction for technical content. See the Microsoft Teams caption workflow for training videos for the full LMS sidecar delivery process.
Google Meet: real-time captions and transcript
Google Meet provides real-time captions via Google's proprietary speech recognition (the same engine used in Google Voice, YouTube auto-captions, and Google Assistant). Captions are enabled by the participant (CC button in Meet) or can be turned on by the meeting host. Accuracy on standard conversation: 82–88%. Accuracy on technical L&D vocabulary: 72–80% — the lowest baseline among major vILT platforms, reflecting less vocabulary customization capability.
Custom vocabulary: Google Meet does not support real-time custom vocabulary configuration (as of 2026). There is no equivalent to Azure Custom Speech for Meet. Accuracy improvement for technical content requires post-processing the recording audio.
Transcript and recording: Google Workspace Business Plus and above include meeting transcription saved to Google Drive. The transcript reflects the real-time ASR quality. Google Meet recordings are saved to Google Drive in .mp4 format. The transcript is a separate Google Doc, not a standard .srt or .vtt file — it must be converted (typically using a script or export tool) before it can be uploaded as an LMS caption sidecar. See the Google Meet caption workflow for training videos for the conversion and delivery process.
Webex: Webex Assistant and Live Transcription
Webex provides real-time captions via Webex Assistant (Cisco's AI engine, using Azure Cognitive Services for speech recognition in many regions). Accuracy on standard conversation: 83–90%. Accuracy on technical L&D vocabulary: 72–81%. Vocabulary customization: Webex Assistant supports limited vocabulary customization via Webex Meetings settings ("Custom Words" feature, accessible from the Webex Admin portal), effective for named entities and common domain terms — functionally similar to Azure Custom Speech but with less granular control.
Webex recording captions: Webex Cloud recordings include a generated transcript accessible from the Webex portal. The transcript file format is .vtt and can be downloaded from the recording page. When the recording is downloaded for LMS upload, the .vtt file must be separately downloaded and uploaded as a sidecar. For Webex-to-LMS workflows, the recording and caption file are two separate downloads from the Webex portal. See the Webex caption workflow for training for the full delivery process.
BigBlueButton: Whisper integration and CART participant mode
BigBlueButton (BBB) is widely used in university LMS environments (Moodle, Canvas) for virtual classroom delivery. BBB supports two real-time caption modes: automated captioning via Whisper integration (available in BBB 3.0+) and CART participant mode (a designated participant joins the session and types captions using a captioning interface visible to other participants).
Whisper-based BBB captions: accuracy on standard speech reflects the Whisper model performance documented in the Whisper accuracy benchmarks by vertical. BBB Whisper integration uses the base or small Whisper model (depending on server configuration), which produces 75–88% accuracy on technical content. Server administrators can configure the model size (larger models produce better accuracy at higher compute cost). Custom vocabulary in BBB Whisper is not natively supported in standard BBB installations — server-level customization is possible for technical operators.
BBB CART participant mode: a CART provider joins the session as a participant, selects the "Closed Captioning" role, and types captions in the BBB captioning interface. The typed captions are displayed in real time to other participants. Accuracy equals the CART provider's performance — same as any CART session. This is the most practical approach for BBB sessions requiring CART-level accessibility, since remote CART providers are familiar with the BBB captioning participant mode.
BBB recordings: BBB recordings are generated on the BBB server as playback files (not .mp4 by default) with associated caption files when captioning was active. The recording format depends on server configuration. For LMS integration: BBB recordings in Canvas typically use the Canvas Media feature or a Kaltura integration; in Moodle, BBB recordings link directly in the course page. Caption extraction from BBB recordings for LMS sidecar delivery requires a download/conversion step that varies by institution setup. See the BigBlueButton caption workflow for institution-specific configurations.
Adobe Connect: CC pod and AI-generated transcript
Adobe Connect provides two captioning paths: the Closed Captioning pod (which allows a CART provider to join as a participant and type captions visible in the pod to all attendees) and Adobe Connect Rooms' AI-generated transcript (introduced in 2024, using Azure Cognitive Services). The CC pod approach is the CART-equivalent for Adobe Connect — the CART provider joins as a named participant with captioner role and types in the CC pod interface. Accuracy equals CART provider performance.
Adobe Connect AI captions: accuracy on technical content: 74–83%. Limited vocabulary customization. The AI-generated transcript is accessible after the session from the Adobe Connect session summary. Adobe Connect recordings (.mp4 download from the Connect portal) must have caption files separately exported from the session and uploaded to the LMS. See the Adobe Connect caption workflow for the full process.
The multi-speaker Q&A problem and the protocols that fix it
Why Q&A is the hardest captioning scenario
The structured presentation portion of a live training session — one instructor presenting to the group from a prepared script — is the easiest captioning scenario: single speaker, prepared vocabulary, predictable pace. Q&A and discussion portions introduce every compounding failure condition: multiple speakers, unscripted vocabulary, rapid turn-taking, variable audio quality, unmic'd participants, and occasional simultaneous speaking.
For CART providers, Q&A requires sustained high-speed attention to speaker transitions that are harder to follow than uninterrupted presentation. The CART operator must identify who is speaking (often without a visual on remote CART), anticipate the register shift (from technical presentation to conversational question), and handle vocabulary that was not in the pre-briefed list (impromptu questions often introduce new proper nouns and organizational references that the CART provider has not pre-loaded).
For real-time ASR, Q&A compounds accuracy loss in three ways: speaker diarization errors increase with rapid transitions, audio quality from individual participant microphones varies widely, and the ASR model's sentence-boundary detection is less reliable when speakers interrupt each other or trail off mid-sentence. The accuracy drop from presentation-mode to Q&A mode in Zoom or Teams ASR is typically 5–12 percentage points — a session running at 80% accuracy during structured presentation may drop to 68–75% during live Q&A.
The unmic'd audience member problem
In an in-room ILT session, audience questions are the most common caption failure point. A participant raises their hand, asks a question verbally without a microphone, and the CART provider (remote) or ASR (picking up room audio via the presenter's microphone) receives muffled, low-volume audio. ASR accuracy on unmic'd room audio is typically 50–70% — well below any compliance threshold. Remote CART providers cannot lip-read unmic'd speakers they cannot see. The question may be entirely inaudible in the CART transcript.
The protocol that resolves this: require the session moderator to repeat every audience question before answering it. "The question is: [restate the question verbatim]." This is a facilitation practice that serves multiple functions — it confirms the moderator understood the question, it gives the room a second chance to hear the question, and it provides the CART provider and ASR with a clear, mic'd, single-speaker audio of the question content. This repetition practice should be scripted into the facilitator's opening instructions ("I'll be repeating each question for clarity before we answer it").
The vILT multi-participant Q&A problem
In a vILT session, every participant has their own microphone — which appears to solve the unmic'd problem. But participant microphone quality varies dramatically (laptop microphone vs external USB microphone vs phone speakerphone), and unmuted participant background audio (keyboard typing, ambient noise, children, pets) introduces additional audio confusion that degrades both CART and ASR accuracy. Most vILT facilitators ask participants to remain muted unless speaking — which helps but means the moderator must unmute participants for questions, adding a transition step where audio may drop.
For vILT Q&A sessions: the "raise hand and type question in chat" protocol is valuable not as a replacement for verbal Q&A but as a parallel channel. Participants type their questions in chat, which creates a written record that is automatically accessible in the session transcript. The moderator reads the chat question aloud before answering — which provides the CART provider or ASR with a clear, moderator-voiced version of the question. This approach also creates an accessibility benefit for participants who may not hear or caption-read a question submitted only verbally.
Panel discussions: the hardest live scenario
Panel discussions — four or more speakers who may speak consecutively or simultaneously — are the most challenging live training captioning scenario. ASR speaker diarization (identifying which speaker is producing which text) degrades significantly when more than two speakers are active, and completely breaks down when speakers overlap. CART providers can handle panel discussions when panel members are mic'd and identified, but rapid simultaneous speaking taxes any CART operator's capacity.
Panel-specific protocols: require panel members to identify themselves before speaking ("As [Name], I'd say..."), use a moderator-controlled turn-taking structure rather than open discussion, provide the CART provider with a panel seating chart or participant order (even for remote video panels), and limit simultaneous speaking through explicit facilitation. For CART sessions with panels of four or more people, brief the CART provider on the panel format in advance — a multi-speaker panel requires different CAT setup than a single-presenter session.
Speaker identification in the CART transcript and post-event recording
The CART provider's transcript typically includes speaker attribution — who said what — when the provider can identify speakers from the briefing materials and audio. This speaker attribution carries forward into the aligned transcript file used for the post-event recording caption track. Speaker-attributed captions in the recording (displayed as "[Speaker Name]: transcript text" or differentiated by position on screen) significantly improve the accessibility value of the recording for deaf and hard-of-hearing viewers, who rely on caption context to track multi-speaker conversations.
For ASR-captioned recordings, speaker diarization from Teams and Zoom assigns text to named participants when speaker identification is working correctly. In downloaded transcript files (.vtt), speaker names are typically embedded as cue annotations. Verify that speaker attribution in the downloaded .vtt is accurate before uploading to the LMS — diarization errors in the downloaded file are common when participant audio quality varied significantly during the session.
The post-event recording: why it needs its own caption track
The timestamp misalignment problem
The most fundamental reason the live session caption stream cannot directly become the recording caption file is timestamp misalignment. Real-time captions (whether CART or ASR) are timestamped to the clock time of the session. A Zoom session that starts at 2:00:00 PM EDT has its first caption event timestamped at approximately 14:00:05 (in CART transcript files) or at a Zoom-internal clock equivalent. The recording's video timecode begins at 00:00:00.000 when the recording starts. These two time references are incommensurable — the CART transcript timestamp of 14:00:05 does not tell you where to place that caption in a video file that begins at 00:00:00.
Converting a CART transcript to a recording-synchronized caption file requires a one-time offset calculation. If the recording started at 14:02:17 clock time (two minutes and seventeen seconds after the session began), every CART timestamp must be decremented by 02:17 to align to recording timecode. The CART provider can perform this alignment when delivering the post-session transcript, or the L&D team can perform it using a transcript editor. The key operational requirement is knowing the exact recording start time — which must be noted during the session (or extracted from the video file metadata).
The recording start time and session start time frequently differ because: hosts start recordings mid-session (after housekeeping and introductions), participants join a running session that has been recorded from the start, technical issues cause a re-start of recording, or Zoom/Teams start recording automatically at a pre-set delay. The coordinator should log the exact recording start time during the session (or designate a specific moment like "I'm starting the recording now at 2:03 PM") to make the alignment step reliable.
ASR transcript from the live session vs re-processing the recording audio
For sessions captioned with CART, the post-event transcript is a clean document: the CART operator's text output, aligned and formatted as SRT. For sessions that used platform ASR only, the post-event workflow has two options:
Option 1: Download and use the platform's ASR transcript (.vtt from Zoom, .vtt from Teams, Google Doc from Meet). This preserves the real-time ASR output as the recording caption file. Accuracy reflects the live session ASR quality — 72–85% on technical content — which does not meet the 99% WCAG threshold for prerecorded video. This option is operationally simple but compliance-incomplete for technical training content.
Option 2: Re-process the recording audio through a higher-accuracy captioning workflow. After the session, download the recording file (.mp4 from Zoom, .mp4 from Teams), submit it to a post-production captioning workflow with organizational glossary applied. This produces a new caption file (.srt or .vtt) at the post-production accuracy level — 97–99%+ with glossary correction for domain vocabulary. This is the recommended path for any technically demanding training content where the 99% WCAG threshold applies to the recording. The compliance requirement for the prerecorded recording is clear: the recording is not exempt from WCAG SC 1.2.2 because it was originally a live session.
The post-event re-processing option adds approximately $0.50–$3.00 per recorded minute (depending on the captioning vendor and accuracy level — the same cost structure as post-production captioning vendors). For a 60-minute vILT session, this adds $30–$180 to the session cost. The alternative — publishing a vILT recording with 72–85% ASR accuracy on technical vocabulary — creates an ongoing ADA and WCAG compliance exposure for every subsequent viewer of the recording, which over a six-month archive window may represent dozens or hundreds of learner hours under non-compliant captions.
The dual-deliverable SLA
From an operational standpoint, the post-event recording workflow requires a defined SLA that specifies when the recording will be made accessible and when the caption file will be ready. Common approaches:
24-hour publish window. The recording is published to the LMS within 24 hours of the session. The caption file (CART-aligned transcript or re-processed ASR) is ready within the same 24 hours. This requires the CART provider's transcript delivery SLA to be ≤24 hours, which is achievable for standard sessions.
Same-day publish. For sessions that must be available immediately (compliance training with a deadline, recorded all-hands that employees need to review by end of day), the recording should be published with the best available caption file — platform ASR transcript — and updated with the corrected CART-aligned or re-processed file within 24 hours. The LMS should support caption file replacement without requiring the video to be re-uploaded (most do — see the LMS caption ingestion workflow for platform-specific replacement procedures).
Pending-review hold. The recording is not published until the corrected caption file is ready. Operationally simpler, but may delay access for learners who attended the session and want to review the recording. Appropriate for high-stakes compliance training where caption accuracy is more important than same-day access.
Recording caption file formats and LMS upload
The post-event recording caption file should be delivered as .srt or .vtt, depending on the LMS. The caption format guide covers the format compatibility matrix across LMS platforms. For the majority of L&D LMS environments: .srt is the widest-compatible format, accepted natively by TalentLMS, Docebo, Absorb, Cornerstone OnDemand, and most other major platforms. Kaltura and Panopto use .srt for sidecar upload. Workday Learning uses .vtt in some configurations. When in doubt, .srt is the safer choice for recording caption delivery.
vILT-specific workflow: platform recording delivery and LMS integration
Where recordings go after the session
Each vILT platform has its own recording storage architecture, which determines the manual steps required to move the recording and its caption file into the LMS. Understanding the storage location and file format for each platform is prerequisite to designing a reliable post-event caption delivery workflow.
Zoom: recordings go to Zoom Cloud (when Cloud Recording is enabled) or to the host's local computer (Local Recording). Zoom Cloud recordings are accessible from the Zoom portal under "Recordings." The recording (.mp4) and transcript (.vtt, if AI Companion was active) are separate files in the Zoom portal. Both must be downloaded and uploaded to the LMS separately — or the Zoom recording can be shared via the Zoom portal sharing link and embedded in the LMS via iframe, though this approach does not reliably deliver the SRT to the learner's player in most LMS configurations. Direct download + LMS upload is the most reliable path for WCAG-compliant caption delivery.
Microsoft Teams: recordings go to the meeting organizer's OneDrive (by default) or to a SharePoint site (when configured). The recording appears in OneDrive with a .mp4 extension. The meeting transcript is stored in the same OneDrive folder as a separate file (.docx or .vtt depending on export format). When downloading for LMS upload: download both the .mp4 and the transcript file. In SharePoint-based storage configurations, IT admins control access to the recording — L&D teams should confirm their access to download recordings from SharePoint before the session series begins. The Teams caption workflow covers the LMS-specific upload process.
Google Meet: recordings go to the meeting organizer's Google Drive (when recording is enabled — available in Google Workspace Business Standard and above). The recording is a .mp4 file in Google Drive. The meeting transcript (when enabled) is a Google Doc in the same Drive folder — it must be exported as plain text and converted to .srt format before LMS upload. There is no native .srt export from Google Meet transcripts; conversion requires a third-party tool or a script. See the Google Meet caption workflow for conversion options.
Webex: recordings go to Webex Cloud (Webex Cloud Recording) or to the host's local computer. Webex Cloud recordings are accessible from the Webex portal under "Recordings." The recording (.mp4) and transcript (.vtt) are available for download from the portal. Webex's .vtt transcript includes speaker attribution when Webex Assistant identified speakers during the session. Direct download + LMS upload is the recommended workflow.
BigBlueButton: recordings are generated on the BBB server and are initially in the BBB playback format (a web-based playback page, not a standalone .mp4). For LMS delivery: BBB recordings must be exported or re-rendered as .mp4 files, which requires server-side action by a BBB administrator or the use of a third-party BBB recording export tool. Universities running BBB through Canvas or Moodle typically have LMS plugins that handle recording delivery within the LMS course page directly — but caption file delivery within these plugins varies by institution configuration. BBB Whisper caption files can be exported as .vtt when the BBB server is configured to support caption file extraction.
Adobe Connect: recordings go to the Adobe Connect Content Library (for Connect-hosted sessions) or to the host's local computer (local recording). The recording is available as .mp4 download from the Content Library. Adobe Connect's AI transcript is available from the session summary page. For LMS delivery: download the .mp4 and the transcript file separately. Adobe Connect has native LMS integrations (Moodle, Canvas, TalentLMS, Workday) via SCORM or LTI — recording delivery through these integrations varies by LMS setup. See the Adobe Connect caption workflow for integration-specific guidance.
LMS integration for vILT recordings: platform-by-platform
Cornerstone OnDemand: supports vILT module types with recording link and caption sidecar upload. The Cornerstone video player (built on Kaltura in some configurations) accepts .srt upload at the video-player level. For Zoom-sourced recordings: download Zoom .mp4 + CART-aligned .srt or re-processed caption file → upload both to Cornerstone's vILT recording field. Cornerstone's caption replacement workflow (replacing the ASR-quality caption file with the corrected file after post-processing) is done through the Content module settings, not the vILT session settings — confirm with your Cornerstone admin which caption field drives the player display. The LMS caption audit methodology covers how to verify caption display in the Cornerstone player.
TalentLMS: accepts video upload with .srt sidecar (the .srt must have the same filename as the .mp4, uploaded together in the course builder). For vILT recordings: download the .mp4 from Zoom/Teams/Meet → upload to TalentLMS course → add the .srt caption sidecar in the course unit settings. TalentLMS does not automatically fetch captions from the recording platform — the .srt must be explicitly uploaded. The LMS migration checklist covers how caption sidecar data is handled during TalentLMS course migrations.
Docebo: uses an integrated video player (or Kaltura integration for enterprise customers). For vILT recordings: the recording is typically uploaded to the Docebo media library, and the caption file is added as a sidecar in the media library settings. Docebo's Kaltura integration handles caption delivery at the Kaltura level — uploading .srt to Kaltura is the preferred workflow for Kaltura-integrated Docebo instances. The vILT course record in Docebo links to the recording but does not independently manage the caption file — caption delivery is handled by the underlying video player configuration.
Workday Learning: supports video upload with .vtt sidecar for caption delivery. For vILT recordings: download the recording from the session platform → upload .mp4 to Workday Learning content library → add .vtt caption file in the content settings. Workday Learning's caption player requires .vtt format; .srt files must be converted. For Teams recordings: the Teams transcript .vtt can be used after accuracy correction (or after re-processing) as the Workday Learning caption file. The path dependency for Workday .mp4 content packages documented in the LMS caption ingestion workflow applies to vILT recordings uploaded as standalone content.
Kaltura: the cleanest vILT recording delivery platform. Kaltura accepts .srt and .vtt upload at the media entry level. Zoom, Teams, and Webex all have Kaltura integrations that can automatically transfer recordings to Kaltura — but caption files are not automatically transferred in most integration configurations (they must be uploaded separately or transferred via API). Kaltura's caption editor allows post-upload correction of ASR-generated captions directly in the Kaltura UI, which is useful for a quick review-and-correct workflow before the recording is published to learners.
Panopto: has direct integration with Zoom and Teams that can automatically import recordings to Panopto. Panopto's ASR (machine-generated captions) runs on imported recordings and can be the starting point for caption correction. Panopto's caption editor allows manual correction of the machine-generated captions. Uploading a CART-aligned .srt to replace the machine-generated captions in Panopto: go to the recording's caption settings, select "Upload Captions," and upload the .srt file — Panopto replaces the machine-generated track with the uploaded file. This is the workflow for using the CART-aligned transcript as the Panopto caption track after a vILT session.
In-room ILT: display setup, audio feed, and backup protocol
Display options for in-room CART
In-room ILT caption display has three standard configurations: dedicated secondary screen, shared projector display, and individual-device delivery (tablets or laptops for participants who need the accommodation).
Dedicated secondary screen is the best-practice setup for in-room CART. A separate monitor or TV display, positioned so it is readable from any seat in the room, shows the CART stream continuously without competing with the presentation slides. Placement: ideally below or beside the main presentation screen, not behind the presenter. Font size: minimum 32pt for a display readable at 10–12 feet; larger for deeper rooms. Background: high contrast (white text on dark background or black text on white background). The StreamText display can be configured for size, contrast, and scroll speed — these settings should be tested in the training room before the session.
Shared projector display is operationally simpler (no additional hardware) but creates a conflict when presentation slides change: every slide transition interrupts the caption display. For sessions heavy on slide-based content, the shared projector approach forces participants to choose between watching the presentation and reading captions — which defeats the accessibility purpose. Use the shared projector approach only for short sessions with minimal slide content, or for sessions where a secondary display is not feasible.
Individual-device delivery is the appropriate setup when the in-room accommodation request is for a specific participant. The participant accesses the StreamText URL (or equivalent) on their personal device — laptop, tablet, or smartphone. This approach is invisible to the rest of the room and personalised to the individual's accessibility needs. For ongoing accommodation requests (an employee who regularly attends in-room training with a CART requirement), a device management configuration that pre-configures the CART access URL is more reliable than requiring the participant to set it up each session.
Audio feed for remote CART in in-room sessions
Remote CART providers working in-room sessions face an audio quality challenge that remote vILT providers do not: the room microphone setup is optimized for the room audience, not for the CART provider's audio feed. The room presenter's lapel or handheld microphone is typically routed to in-room speakers; a separate audio channel for the remote CART provider must be explicitly configured.
The cleanest solution is a dedicated phone bridge: the remote CART provider dials into a conference call that receives the presenter's microphone audio via a direct feed from the room's audio system (through an audio interface or direct-out connection from the mixing board). The CART provider hears clean, close-mic audio independent of room acoustics. This requires audio system access that may need to be arranged with the AV team in advance.
If a dedicated phone bridge is not feasible: the remote CART provider joins a Zoom or Teams call running silently in the background, with the room's computer microphone or a USB microphone positioned near the presenter. Audio quality via this method is lower than a direct phone bridge — acceptable for single-presenter sessions with good room acoustics, but unreliable in larger rooms or with audience Q&A.
Backup protocol for CART provider unavailability
The most common operational failure in live training captioning is CART provider unavailability on the day of the session: illness, technical failure, or scheduling error. The captioning compliance program should include a written backup protocol that all session coordinators know and can execute within 30 minutes of a CART failure notification.
Tier 1: Immediate agency replacement. If the CART provider was booked through an agency, the coordinator contacts the agency for an emergency replacement. Standard agency response time for emergency replacement is 1–3 hours — which may not cover a session starting in the next 30 minutes. Agencies with dedicated emergency lines and bench capacity can sometimes provide replacement within 30–60 minutes. This tier is viable only if the agency relationship was established in advance with explicit emergency protocols.
Tier 2: Platform ASR fallback with session hold. Enable the platform's native ASR captions, notify any hearing-impaired participants of the change, and proceed with the session. Document the CART failure and the fallback used. Commit to delivering a high-accuracy caption file for the recording within 24–48 hours. This tier is appropriate when proceeding with the session is necessary (compliance training deadline, large audience) and the hearing-impaired participants accept the ASR fallback with the post-event corrected recording as the accessibility accommodation.
Tier 3: Session postponement. If the session cannot be delayed, cannot be served by available ASR, and hearing-impaired participants cannot access the content without CART, the session should be rescheduled. Postponement is the highest-disruption option but the only correct option when the legal obligation is ADA Title II (public entities) or Section 508 (federal agencies) and the hearing-impaired participant cannot be served by the fallback tier. Document the reason for postponement in the accommodation request record.
Compliance framework: ADA, Section 508, WCAG 2.1 SC 1.2.4, EAA, and AODA
ADA Title I: employer-to-employee live training
Under ADA Title I (42 U.S.C. §§ 12101–12117) and its implementing regulations (29 C.F.R. Part 1630), employers must provide equal access to training for employees with disabilities, including those who are deaf or hard of hearing. The specific standard for live training is "effective communication" — not a numerical accuracy percentage, but a functional standard that communication with a hearing-impaired employee must be "as effective as communication with others." The "effective communication" standard comes from ADA Title II regulations (28 C.F.R. § 35.160) but has been applied by the EEOC and courts to Title I employment training contexts.
In practice, "effective communication" for live training means: a hearing-impaired employee who is assigned a mandatory training session must have access to the training content that is equivalent to what hearing employees receive. If hearing employees receive live audio content and captions that miss 20–30% of the technical vocabulary, the hearing-impaired employee is not receiving equivalent access. The best practice operationalization is CART (or equivalent 98%+ accuracy captioning) for any live training session where a hearing-impaired employee is attending. The compliance matrix covers which law applies to which organization type in detail.
ADA Title I requires accommodation upon request — an employer is obligated to provide CART captioning when a hearing-impaired employee requests it as a reasonable accommodation. The CART booking lead time is relevant here: if an employee requests CART three days before a session, the employer must make reasonable efforts to accommodate, which typically means booking CART on an emergency basis. If emergency CART is not available within the accommodation lead time, the session should be offered at a rescheduled date when CART can be provided. The accessibility coordinator role includes managing the accommodation request intake and CART booking workflow.
ADA Title II: public entities and mandatory WCAG 2.1 AA
State and local government agencies, public school districts, and public universities are subject to ADA Title II (42 U.S.C. §§ 12131–12165). As of April 24, 2026, Title II regulations published by DOJ (28 C.F.R. Part 35) require that web and digital content for ADA Title II entities meet WCAG 2.1 Level AA. WCAG 2.1 AA includes SC 1.2.4 (Captions — Live), which requires real-time captions for live audio content in synchronized media at the "equivalent access" standard.
For public universities that deliver vILT training sessions to students and staff: every recorded vILT session made available after the fact must meet WCAG SC 1.2.2 (Captions — Prerecorded) at 99% accuracy. The live session must meet SC 1.2.4 at effective communication standard. Public university compliance programmes typically combine CART for lecture capture and high-enrollment live sessions with ASR fallback for lower-enrollment sessions, with a post-event correction workflow for recordings. See the university captioning buyer guide and the university lecture capture workflow for public university-specific implementation guidance.
Section 508: federal agencies and live webcasts
Section 508 of the Rehabilitation Act (29 U.S.C. § 794d) requires federal agencies and their contractors to make electronic and information technology — including training content — accessible to employees and members of the public with disabilities. The Section 508 ICT Standards and Guidelines (36 C.F.R. Part 1194, updated 2017) explicitly cover live webcasts and synchronised media. WCAG 2.1 Level A and AA criteria are incorporated by reference, including SC 1.2.4 for live captions.
Federal agencies operating vILT training programmes must provide real-time captions for all live training sessions where deaf or hard-of-hearing employees, contractors, or members of the public are participants. The Section 508 captioning requirements page covers the regulatory specifics, including the OCIO/Accessibility Program Office review requirements for federal training content.
WCAG 2.1 SC 1.2.4: Captions (Live)
WCAG 2.1 SC 1.2.4 requires that "captions are provided for all live audio content in synchronized media." At Level AA (the standard required by ADA Title II and referenced in Section 508), this means live training sessions delivered via vILT platforms must have real-time captions. WCAG 2.1 does not specify a numerical accuracy percentage for live captions in SC 1.2.4 (unlike SC 1.2.2 for prerecorded content, where the 99% DCMP standard is widely applied). The understanding documents for SC 1.2.4 describe captions that "match the audio" and provide "full and equal access."
The practical interpretation applied by accessibility evaluators and DOJ guidance is that live captions must be accurate enough to provide a hearing-impaired viewer with equivalent access to the audio content. For technical training content, platform ASR at 72–85% accuracy does not meet this standard — approximately 15–28% of technical vocabulary is missed, which means a hearing-impaired learner receives meaningfully less content than a hearing learner. CART at 98–99%+ accuracy meets the SC 1.2.4 standard for technical training content.
EAA and EN 301 549: European live training sessions
The European Accessibility Act (EU Directive 2019/882, in force June 28, 2025 for most member states) covers digital services including online training platforms provided to consumers and employees by EU-registered entities. EN 301 549 (the harmonised European accessibility standard referenced by EAA) incorporates WCAG 2.1 Level AA requirements, including SC 1.2.4 for live audio content.
EU-based L&D teams delivering vILT training in EU member states must provide real-time captions meeting EN 301 549 / WCAG SC 1.2.4 requirements. The EAA captioning requirements post covers which organizations are in scope, the member-state enforcement timeline, and what EN 301 549 compliance requires for training content. For EU-based teams delivering multilingual live training, the caption language requirement follows the session language — a German-language vILT session requires German-language live captions. The multilingual caption workflow covers how to configure CART and ASR for non-English training sessions.
AODA: Ontario, Canada employer requirements
The Accessibility for Ontarians with Disabilities Act (AODA) Integrated Accessibility Standards Regulation (IASR) requires Ontario employers with 50 or more employees to provide training in accessible formats to employees with disabilities. AODA does not specify WCAG 2.1 AA as the live training standard, but the IASR requires that training is provided "in an accessible format or with appropriate communication supports" upon request. CART is the appropriate communication support for deaf or hard-of-hearing employees attending mandatory training sessions. The AODA accessibility plan template covers how to document the live training accommodation process in an AODA Accessibility Plan.
Glossary architecture for live captioning: CART pre-loading and ASR vocabulary configuration
The organizational glossary as a live captioning input
The organizational glossary that drives post-production caption accuracy — the same glossary documented in the glossary architecture post and maintained through the caption feedback loop — is also the source material for live session captioning quality. The difference is the delivery mechanism: for post-production captioning, the glossary is applied algorithmically to the ASR output or submitted to the vendor as a term list; for live CART captioning, the glossary is pre-loaded into the CART provider's CAT software before the session.
Maintaining a single organizational glossary that feeds both workflows — rather than separate glossary documents for post-production and CART — is the most efficient architecture. The glossary file (CSV or RTF format) contains all organizational terms with phonetic spellings, context notes, and vertical tags (engineering, compliance, medical, HR). The post-production workflow applies the glossary algorithmically; the CART briefing process extracts the session-relevant subset (e.g., engineering terms for a DevOps session, compliance terms for a regulatory training session) and delivers that subset to the CART provider.
Pre-loading glossary into CART provider CAT software
Modern CAT (Computer-Aided Translation) software used by CART providers — Case Catalyst (most common in the US), Eclipse, DigitalCAT — supports custom vocabulary dictionaries. Pre-loading organizational vocabulary into these dictionaries allows the CAT software to autocomplete known terminology when the steno strokes are ambiguous, reducing the cognitive load on the CART operator and improving consistency on technical terms.
Pre-loading protocol: deliver the session vocabulary subset (extracted from the organizational glossary) to the CART provider as a plain text file or RTF, with one term per line and phonetic notes for non-obvious pronunciations. Include context sentences for terms that could be confused with common English words (e.g., "SCIM" in an IT context = "System for Cross-domain Identity Management," not the verb "to skim"). Delivery timeline: two to three business days before the session — this gives the CART provider time to import the vocabulary file, review unfamiliar terms, and practice pronunciation before the session.
For recurring training programmes using the same CART provider or agency, the vocabulary pre-loading is a one-time setup that accumulates over time. A DevOps training programme that works with the same CART agency over six months will build a CAT dictionary that covers the full organizational vocabulary, dramatically reducing briefing effort for each new session. This is the live-session equivalent of the compounding accuracy effect in post-production captioning — the investment in vocabulary quality compounds over successive sessions.
Real-time ASR vocabulary configuration by platform
Microsoft Teams / Azure Custom Speech: the most flexible vocabulary configuration option among vILT platforms. The process: create an Azure AI Speech resource in Azure Portal → upload a custom vocabulary file (plain text, one term per line; Azure supports pronunciation variants and phonetic strings) → link the Azure Speech resource to the Microsoft 365 tenant via the Teams admin portal. Custom Speech model processes all Teams session audio with the custom vocabulary weighting, improving recognition of domain terminology in every session without per-session configuration. IT admin setup time: 2–4 hours. Ongoing maintenance: update the custom vocabulary file when organizational vocabulary changes (new product names, regulatory terms, organizational identifiers).
Zoom: no native real-time custom vocabulary support. Custom vocabulary improvement requires a third-party ASR integration (e.g., Rev.ai via Zoom Apps, which supports glossary configuration for live sessions). For organizations running Zoom vILT at scale, the Rev.ai integration is worth evaluating if live session caption accuracy on technical content is a recurring compliance concern.
Webex: the "Custom Words" feature in Webex Assistant settings supports adding organizational terms to the recognition vocabulary. Scope is limited (most implementations allow hundreds of terms, not thousands) but effective for the highest-frequency domain-specific terms — product names, acronyms, regulatory identifiers that appear in every session. Access: Webex Meetings Admin portal → Settings → Webex Assistant → Custom Words.
Google Meet, BigBlueButton, Adobe Connect: no native real-time custom vocabulary configuration. Post-processing the recording audio through a glossary-aware captioning workflow is the primary accuracy improvement mechanism for these platforms.
The glossary coverage gap for live sessions
One structural challenge in live session captioning is that the vocabulary in Q&A and discussion portions is less predictable than in structured presentation content. A presenter working from a script uses vocabulary that can be pre-briefed. An audience member asking an impromptu question may introduce organizational terms, product names, or acronyms that were not in the pre-briefing materials — especially if the questioner is from a different team or vertical than the primary session audience.
The practical response to this gap is to make the pre-briefing glossary as comprehensive as possible (include all organizational vocabulary, not just the session-specific subset), to use the moderator-repeats-questions protocol (which gives the CART provider a mic'd, moderator-voiced version of any unexpected vocabulary), and to treat the Q&A portion of the transcript as requiring additional post-event review when it is used as the recording caption source. The glossary-biased captioning post covers how glossary coverage depth affects accuracy on unexpected domain vocabulary in post-production contexts — the same principles apply to CART pre-briefing depth.
Eight failure modes
- Failure mode 1: CART booked day-before or day-of
- The session is scheduled three weeks out, but CART booking is treated as a day-before logistics step. The CART provider availability lookup on the day before the session returns "unavailable" for all providers, and the session proceeds with platform ASR as the only caption source. The fundamental fix is to make CART booking a step in the session creation workflow — it happens the same day the session is put on the calendar, not the day before. Programme managers who treat CART as a last-minute add-on consistently encounter availability failures for peak scheduling periods (Monday mornings, end-of-quarter compliance training marathons, onboarding weeks for large cohort hires).
- Failure mode 2: CART provider unbriefed on technical vocabulary
- The CART provider is booked correctly and arrives at the session on time, but receives no briefing materials — no vocabulary list, no acronym expansions, no speaker names. The session is a technical security operations training with 40+ domain terms (SIEM, SOAR, CVE identifiers, specific vendor product names). The CART provider's output is 85–90% accurate on general conversational content but misses 25–35% of domain-specific vocabulary. Hearing-impaired participants receive a caption stream that is missing the technical content they most need. The fix: vocabulary pre-loading is a checklist item in the session scheduling confirmation, not an optional add-on. The session coordinator sends briefing materials no later than two business days before the session.
- Failure mode 3: post-event recording published without a caption file
- A well-captioned vILT session is recorded in Zoom. The coordinator downloads the recording .mp4 and uploads it to the LMS within two hours of the session ending. The CART-aligned transcript has not arrived yet (the CART provider has a 24-hour delivery SLA). The coordinator uploads the recording without a caption file, marks it available in the LMS, and moves on. The CART transcript arrives the next day and is never uploaded — the coordinator assumes the LMS has captions because "the session had CART." The recording remains without captions for the remainder of its LMS life. The fix: LMS publishing workflow must include a caption file as a required field — the recording is not published as "accessible" until the caption file is attached.
- Failure mode 4: ASR transcript from live session used as recording caption file without accuracy correction
- A Teams vILT session is recorded, the transcript is automatically saved to the organizer's OneDrive, and the coordinator downloads the .vtt transcript and uploads it to the LMS alongside the .mp4 recording. The transcript reflects the live session ASR quality — 78% accuracy on technical compliance training content. The recording is now accessible to the async audience with 78% caption accuracy, meeting WCAG SC 1.2.2's requirement for "captions" but not meeting the 99% accuracy standard that makes those captions meaningful for technical vocabulary. The fix: the download-and-upload transcript workflow is a starting point for caption file delivery, not a compliance endpoint. The downloaded ASR transcript must be reviewed and corrected — or re-processed through a higher-accuracy workflow — before the recording is published as WCAG-compliant.
- Failure mode 5: timestamp misalignment between CART transcript and recording
- The CART provider delivers the post-session transcript as a .srt file. The coordinator uploads it to the LMS alongside the recording. The captions are visibly off — they appear 2 minutes and 30 seconds late relative to the video. No one notices immediately because the coordinator does not verify the caption-video sync in the LMS player before publishing. Hearing-impaired learners who access the recording encounter captions that reference content from two and a half minutes earlier in the video. The fix: the recording start time must be logged during the session (exact time the recording was started), communicated to the CART provider with the post-session transcript request, and the CART provider must apply the offset before delivering the aligned .srt. The coordinator must verify sync in the LMS player on a spot-check basis before publishing.
- Failure mode 6: Q&A captions fail because no moderator-repeats protocol
- A 90-minute leadership training session uses CART for the 60-minute structured presentation portion, achieving 98%+ accuracy. The 30-minute Q&A with audience questions from a 40-person in-room group uses the same CART setup, but no moderator-repeats protocol. Questions from audience members without microphones reach the CART provider as muffled, low-volume audio. Seven of nineteen questions are entirely missing from the CART transcript. A deaf participant who attends the session receives 60 minutes of excellent live caption access and 30 minutes of effectively absent access during the Q&A. The fix: the moderator-repeats-questions protocol is scripted into every live training session facilitation plan — not improvised, not optional, not dependent on the individual facilitator knowing to do it.
- Failure mode 7: CART stream URL not distributed to participants
- The CART provider is booked and briefed. The CART stream is live during the session via StreamText. No one tells participants where to access it. The session coordinator knows the StreamText URL but did not include it in the session invitation, did not post it in the meeting chat, and did not announce it at the session opening. A hearing-impaired participant who specifically requested CART as an accommodation attends the session, sees no caption feed in their Zoom window (because the CART is on StreamText, not in the Zoom caption display), and messages the coordinator — who finds the URL and posts it in chat 22 minutes into the session. The fix: the CART stream URL is included in the session invitation as a standard field (alongside dial-in numbers and meeting links) and read aloud at session open. A dedicated caption-access line in the session invitation template ensures no session goes without this disclosure.
- Failure mode 8: LMS caption file replaced during recording migration but not re-attached
- A library of 200 vILT recordings is migrated from Zoom Cloud to Panopto as part of an LMS platform transition. The .mp4 files transfer successfully. The caption sidecar files — individually uploaded to the original LMS — are not included in the migration export because the migration tool exported video files, not video + caption package bundles. The 200 recordings in Panopto have Panopto's auto-generated captions (72–82% accuracy on the technical training content) instead of the CART-aligned or manually corrected files that were in the original LMS. The fix: the LMS migration caption checklist includes a pre-migration audit of caption sidecar file locations and a post-migration verification step that confirms the correct caption file is attached to each recording in the destination LMS.
Seven-question FAQ
- Does every live training session legally require CART captioning?
-
No — the legal obligation is to provide "effective communication" for employees or participants with hearing disabilities who request accommodation, not to provide CART by default for every session. Under ADA Title I, the employer must provide reasonable accommodation (including CART) when a hearing-impaired employee requests it with reasonable advance notice. Under ADA Title II, public entities must proactively ensure that training programmes are accessible and must respond to specific accommodation requests. Section 508 requires federal agencies to plan for accessibility in advance, which in practice means having a CART booking process in place for any live training session open to deaf or hard-of-hearing employees or members of the public.
In practice, the most defensible approach for L&D teams running recurring training programmes is to have CART available for any session attended by a known hearing-impaired participant, and to have a process for booking CART within 24–48 hours when an accommodation request is received close to a session date. Teams that proactively caption all live training sessions with CART avoid the accommodation request intake and booking scramble, and may find the programme-level CART cost lower than the per-session emergency booking cost for reactive accommodation.
- What is the difference between CART and an ASL interpreter, and when does each apply?
-
CART and ASL (American Sign Language) interpreting are both accommodations for hearing-impaired participants, but they serve different populations and communication needs. CART provides real-time text captions readable by anyone who can read English — appropriate for participants who are deaf or hard of hearing and whose primary communication modality is English text (late-deafened adults, participants with moderate-to-severe hearing loss who grew up with English as their primary language). ASL interpreting provides visual signed language — appropriate for participants whose primary language is ASL (typically those who are prelingually deaf and grew up in Deaf culture).
The correct accommodation is determined by the participant's preference, not the L&D team's assumption. When an employee with hearing loss requests an accommodation for a live training session, the intake process should ask which type of accommodation they need — CART, ASL interpreter, or both. Some participants use both (CART as a backup for interpreter fatigue in long sessions, or interpreter for interactive portions and CART for technical reference). The accommodation request intake form should offer both options and defer to the participant's stated preference.
- How do we handle a vILT session with participants in multiple languages — does CART apply?
-
CART in the session's primary language serves hearing-impaired participants who read in that language. For multilingual vILT sessions where some participants are hearing-impaired and others are non-native speakers who rely on captions for language support, the caption language should match the session language. If the session is conducted in English, English CART captions serve both hearing-impaired participants and non-native speakers who prefer English text. If the session is conducted bilingually (e.g., English presenter with simultaneous French interpretation for Quebec-based participants under AODA), separate caption streams may be needed for each language.
Live translation captioning — CART in one language translated to captions in another in real time — is an advanced configuration that requires a CART provider certified in the source language and a translation pipeline. This is operationally complex and expensive; most multilingual L&D teams instead provide CART in the session's primary language and rely on the post-event recording multilingual caption workflow for participants who need captions in a different language after the session. See the multilingual caption workflow post for the post-event translation pipeline.
- Can platform AI transcription (Zoom AI Companion, Teams Copilot) replace CART for an L&D compliance training session?
-
No. Zoom AI Companion and Microsoft Copilot in Teams generate meeting summaries and action items from AI-processed transcripts — they are productivity tools, not caption compliance tools. Their output is not synchronized to video timecode (which captions require), not designed for real-time participant display (which live captions require), and not tested against WCAG or ADA accuracy standards. Using an AI meeting summary as the caption file for a vILT recording would produce a document that summarizes the session rather than transcribing it word-for-word with timing — which is not what SC 1.2.2 requires.
The relevant platform features for caption compliance are: Zoom Automated Captions (real-time display during session), Teams Live Captions (real-time display during session), Zoom transcript (.vtt from Zoom Cloud recording), and Teams Meeting Transcription (.vtt from Teams). These are the features that produce time-stamped, word-for-word transcripts suitable for captioning — though all require accuracy correction for technical training content before they meet WCAG standards.
- How do we handle CART captioning for in-house L&D sessions conducted by subject-matter experts who speak quickly or use heavy jargon?
-
Fast-speaking and jargon-heavy subject-matter experts are a common challenge for CART providers. The briefing protocol is the primary mitigation: a detailed vocabulary list, acronym expansion sheet, and advance review of the SME's speaking style (via a recording of a previous session, if available) helps the CART provider anticipate the pace and terminology. CART providers certified in technical content (some specialize in legal, medical, or STEM content) may have better baseline familiarity with domain jargon than generalist providers.
Additional steps for fast-speaking SMEs: request that the SME slow their pace by 10–15% for session segments where technical density is highest (this is a reasonable facilitation adjustment, not an unreasonable constraint on the SME); use the Q&A window to allow catch-up time in the caption stream; have the coordinator monitor the StreamText display during the session and alert the CART provider (via the private coordinator channel) when vocabulary errors are visible. After the session, the CART provider can review and correct errors in their transcript before delivering the post-session SRT.
- What is the practical difference between WCAG SC 1.2.2 (captions for prerecorded content) and SC 1.2.4 (captions for live audio) for L&D compliance purposes?
-
SC 1.2.2 requires captions for prerecorded synchronized media (video with audio) — this applies to the vILT recording after the session ends. SC 1.2.4 requires captions for live audio content in synchronized media — this applies to the session while it is happening. The distinction creates two separate compliance windows for the same training content: the live session (SC 1.2.4) and the post-event recording (SC 1.2.2).
SC 1.2.4 does not specify a numerical accuracy percentage; it requires captions that "match the audio" sufficiently to provide "full and equal access." SC 1.2.2 understanding documents and legal guidance (DOJ, OCR enforcement letters) consistently reference the 99% DCMP accuracy threshold as the operative standard. The practical implication: live session captions must be accurate enough to provide equivalent access (typically interpreted as ≥98% for technical content to be meaningful), while the post-event recording must meet the 99% WCAG threshold before publication to the async audience. This difference explains why the post-event recording cannot simply inherit the live session caption stream — the live stream's ASR-level accuracy (72–85%) meets neither SC 1.2.4 standards for complex content nor SC 1.2.2 standards for the recording.
- Our vILT sessions are only 30 minutes and rarely have hearing-impaired participants — is CART still operationally necessary?
-
The legal answer is: the obligation is triggered by the participant's presence, not the session length. A 30-minute compliance training session with one known hearing-impaired participant requires the same "effective communication" accommodation as a four-hour leadership development programme. The duration of the session does not reduce the compliance obligation.
The operational answer is: the scalable approach for short recurring sessions is to establish a standing CART relationship with an agency that provides the service at a session volume rate, rather than sourcing per session. At $100–$175/hour for remote CART with a one-hour minimum, a 30-minute session costs the same as a 60-minute session under most CART pricing — which makes the per-session cost of CART $100–$175 for short sessions. An annual CART contract for a programme with 50 recurring 30-minute sessions might negotiate down to $60–$90/session at volume. For programmes with very low hearing-impaired participant frequency, the reactive accommodation model (CART booked when an accommodation request is received) may be operationally appropriate, with a confirmed emergency booking protocol in place for requests received close to session dates.
Achieve 99% caption accuracy on your vILT recordings
Live training sessions produce recordings that must meet the same 99% WCAG 2.1 AA accuracy standard as any other training video in your LMS — and most vILT recording workflows leave those recordings with 72–85% ASR-quality captions. GlossCap closes the gap. Upload your vILT recording .mp4 (from Zoom, Teams, Google Meet, Webex, BigBlueButton, or Adobe Connect) and GlossCap applies your organizational glossary to produce a WCAG-compliant caption file for LMS upload: 99% accuracy on the technical L&D vocabulary that matters in your training content, regardless of which platform hosted the original session. Start with a free accuracy spot-check on one of your existing vILT recordings.