Remote Workforce Operations · Published 2026-06-06
Captioning async training video for remote and hybrid teams: Loom, Zoom recordings, Microsoft Teams, and the home-office audio problem
The post-2020 shift to remote and hybrid work changed who produces training video, where they record it, and how it is delivered — but it did not change the legal obligation to caption it. WCAG 2.1 AA SC 1.2.2 requires synchronized captions for all prerecorded audio-video content accessible to employees, regardless of whether those employees work in an office, from a bedroom, or on a rotating hybrid schedule. ADA Title I requires employers to provide effective communication to employees with hearing disabilities — and "effective communication" for a training video in an LMS is a synchronized caption track, not a PDF transcript available on request. The shift to remote work changed the distribution of caption compliance risk in two ways that L&D and training-operations teams frequently underestimate: first, the volume of captionable video exploded, because the number of people producing training recordings multiplied by the number of people who used to present in conference rooms and now record Loom walkthroughs instead; and second, the audio quality of that new production volume degraded substantially, because home offices, laptop built-in microphones, and residential HVAC systems are not controlled recording environments. The combination — more video, worse audio — creates a caption operation that is both larger and harder than the one the team managed before March 2020.
The home-office audio problem is measurable and specific. In a controlled office recording environment with a dedicated USB condenser microphone and acoustic treatment, Whisper-large achieves a baseline word error rate of 3–5% on clean English speech — a 95–97% accuracy floor that a domain glossary can push to 99%+ on technical vocabulary. In a typical home office recording — laptop built-in microphone, residential HVAC noise at 40–50 dB, occasional street traffic, kitchen sounds through an open door — that same Whisper-large model achieves a baseline WER of 12–18% on identical content. The signal-to-noise ratio degradation from home recording conditions costs 7–12 percentage points of baseline accuracy before a single domain-specific term is encountered. For training video with high technical vocabulary density — engineering onboarding, product training, compliance modules — the glossary cannot fully compensate for a degraded audio source. The vertical accuracy benchmarks document baseline WER by content type in controlled conditions; the home-office degradation is additive on top of the vertical baseline. An engineering onboarding module recorded in a home office may start at 78–83% baseline accuracy rather than the controlled-environment baseline of 88–93% — meaning the glossary has to close a 17–22% accuracy gap rather than a 7–12% gap, and some of that gap cannot be closed by vocabulary injection because it comes from acoustic noise masking the phoneme signal entirely.
The production architecture also changed. Pre-2020, training video was typically produced by a small group of L&D professionals using dedicated recording equipment, uploaded to a central LMS, and captioned through a defined workflow with a named owner. Post-2020, the production is distributed: a sales enablement manager records a Loom walkthrough of the new pricing deck; a software engineer records a Zoom session explaining the API changes; a compliance officer records a Teams meeting covering the updated policy; a product manager records a Vimeo video of the new feature demo for the customer success team. None of these producers are L&D professionals. None of them are thinking about caption compliance at the time of recording. None of their recordings go through the caption workflow automatically, because the caption workflow was designed for the small-group professional production model, not for the twenty-person distributed production model that emerged when every knowledge worker became a content creator. The result is a distributed production backlog: dozens of async training recordings published to the LMS each quarter without captions, each one a silent compliance gap, none of them visible to the caption operation until an audit or a complaint surfaces the problem.
This post is the operational guide to captioning async training video in a remote and hybrid workforce. It covers the home-office audio problem in detail — what causes it, how to measure it, and what to do about it at the source versus at the caption layer. It covers platform-specific caption workflows for the six platforms that dominate async training video production in distributed teams: Loom, Zoom cloud recordings, Microsoft Teams recordings, Microsoft Stream, Vimeo Business, and Wistia. It covers glossary architecture for multi-speaker async libraries, the WCAG and ADA compliance framework for remote employee training, the eight failure modes that cause async caption operations to break silently, and a seven-question FAQ on the operational decisions that come up most often in distributed L&D environments. The companion posts — live versus recorded caption accuracy, glossary-biased decoding for engineering vocabulary, and caption QA methodology — cover the underlying accuracy infrastructure. This post focuses on what changes specifically when that infrastructure is applied to the home-office recording environment and the distributed production model of remote and hybrid teams.
TL;DR — three things that matter about async remote caption operations
- Home-office audio degrades ASR baseline accuracy by 7–12 percentage points before any domain vocabulary is encountered. The degradation comes from four sources: ambient noise floor (residential HVAC, street traffic, household sounds), consumer microphone directional response (laptop built-in mics capture 360° ambient rather than the speaker's voice), room acoustics (hard surfaces, no acoustic treatment, flutter echo at typical room dimensions), and recording compression artifacts (Zoom, Loom, and Teams all compress audio for transmission at 32–48 kbps, which introduces artifacts that Whisper's decoder was not trained on). A domain glossary closes the vocabulary-accuracy gap; it does not close the acoustic-degradation gap. Audio remediation at the source — recording guidance, external microphone, acoustic panels — is the highest-leverage intervention for async training libraries produced by distributed teams. When source remediation is not feasible, audio pre-processing (noise reduction before transcription submission) is the second-best option.
- Six platforms, six different caption workflows — and no single platform automatically produces WCAG-compliant captions. Loom's auto-transcript uses a general Whisper model with no domain-vocabulary support; SRT export requires a Business plan and a manual download step. Zoom cloud recording auto-captions require explicit enablement by the account admin and produce general-model transcripts; SRT export from the recording dashboard is the correct workflow for LMS upload. Microsoft Teams recordings go to OneDrive/SharePoint and are transcribed by Azure Speech Services without custom vocabulary unless an E5 license with Custom Speech is configured. Microsoft Stream's VTT upload is the correct path for validated captions on SharePoint-hosted video. Vimeo and Wistia both support SRT/VTT upload with accurate playback, but neither platform's auto-captions use domain vocabulary by default. Understanding the specific caption architecture of each platform — where files are stored, what format they require, where the LMS integration breaks — is the prerequisite for a caption operation that doesn't leave gaps.
- The distributed production model requires a different workflow trigger than the centralized production model. A caption workflow designed for the centralized model — L&D submits video, vendor captions, L&D approves, uploads to LMS — will miss every async recording created outside the L&D team unless the workflow trigger is changed. The correct trigger for a distributed async library is publication-event-based, not manual: any new video published to the LMS (or uploaded to the team's Loom, Zoom, Teams, or Vimeo account) should automatically generate a caption job, not wait for a human to remember to submit one. Implementing this trigger — through LMS webhook integrations, Zapier/Make automations, or Loom/Zoom webhook notifications — is the operational change that converts a caption operation from reactive (processing complaints) to proactive (preventing gaps).
The home-office audio problem
Home-office recording environments differ from professional recording environments across four dimensions that matter for ASR accuracy: ambient noise level, microphone characteristics, room acoustics, and recording-system compression. Each dimension affects Whisper's transcription accuracy independently; in combination they produce a measurable and predictable accuracy degradation that any remote caption operation has to plan for.
Ambient noise: what home recordings carry that office recordings don't
A professional recording studio maintains an ambient noise floor below 25 dB(A). A typical open-plan office runs 45–55 dB(A). A home office varies between 35 and 65 dB(A) depending on HVAC system, proximity to traffic, and household activity. The critical measure for ASR accuracy is not the absolute noise level but the signal-to-noise ratio: the level difference between the speaker's voice and the background noise captured by the microphone. At a signal-to-noise ratio above 20 dB — which is achievable with a good desk microphone in a reasonably quiet home office — Whisper's accuracy approaches the controlled-environment baseline. At 10–15 dB SNR — which is typical for laptop built-in microphone recordings in a home office with active HVAC — accuracy degrades by approximately 5–8 percentage points. Below 10 dB SNR — recordings from a laptop in a room with audible street noise, HVAC, or household activity — accuracy degrades by 10–15 percentage points or more, and the degradation is concentrated on lower-frequency consonants and unstressed syllables, which is exactly where proper nouns and technical acronyms tend to lose their distinguishing phonemes.
The specific noise sources that appear most frequently in home-office recordings and their approximate levels at the recording position:
- Residential HVAC: 40–50 dB(A) measured at the grille, 35–45 dB(A) at a typical recording position across the room. Constant broadband noise that competes with speaker vocals across the 300–3000 Hz range most relevant to speech intelligibility.
- Street traffic through windows: 45–65 dB(A) at a recording position in a ground-floor or street-facing room, with variable peaks from individual vehicles or buses. The variability is a particular problem for ASR: a consistent noise floor is more manageable than intermittent noise bursts that mask syllables unpredictably.
- Keyboard typing: 30–45 dB(A) at the microphone for laptop keyboards, louder for mechanical keyboards. The clicking transients are in the 2–5 kHz range, which overlaps with fricative consonants (s, f, sh, th) that are critical for distinguishing word-final endings — the difference between "captions" and "caption," "scripts" and "script."
- Household activity (adjacent rooms): 45–65 dB(A) depending on source — dishwashers, televisions, children. Unlike HVAC, these noise sources are not constant: they appear and disappear mid-recording without warning, creating segments where the ASR accuracy drops precipitously for 5–30 seconds before recovering.
- Refrigerator compressor cycling: 40–50 dB(A) in a kitchen-adjacent home office, with a characteristic cycling hum that Whisper's training corpus contains no real examples of as audio interference in speech recordings. The compressor cycling can be misinterpreted by Whisper as a consistent low-frequency vocal tone, suppressing confidence on overlapping phonemes.
Whisper's noise robustness was benchmarked primarily on conversational audio from podcasts, YouTube, and interview recordings — environments where the ambient noise profile is controlled or where the noise is consistent enough that the model can adapt within the first few seconds. Home-office recordings with variable, intermittent noise sources fall outside the distribution that Whisper's training robustness was designed for. The model's accuracy on noise-burst segments (a truck passing mid-sentence) is significantly worse than its accuracy on the same segment in a constant-noise environment.
Consumer versus professional microphone characteristics
The microphone characteristic that matters most for ASR accuracy is not frequency response or dynamic range — it is polar pattern. Professional condenser microphones used in recording studios have cardioid or hypercardioid polar patterns, meaning they are most sensitive to audio from directly in front of the microphone capsule and attenuate sounds from behind and to the sides. A laptop built-in microphone has an approximately omnidirectional polar pattern: it captures audio from all directions with roughly equal sensitivity. The speaker's voice and the ambient noise in the room are captured at similar levels, producing a lower effective SNR than the cardioid microphone would achieve even in the same room.
The practical consequence: the same speaker, in the same room, with the same ambient noise, produces recordings with measurably different SNR depending on the microphone. A cardioid USB desk microphone positioned 15–20 cm from the speaker's mouth typically achieves 20–25 dB SNR in a typical home office. A laptop built-in microphone on the same desk in the same room typically achieves 10–15 dB SNR — a 10 dB SNR difference that translates directly to 5–8 percentage points of baseline ASR accuracy. The microphone choice is the single most controllable variable in home-office recording quality, and it is the intervention that produces the largest accuracy improvement per dollar spent: a $50 cardioid USB microphone closes more of the home-office accuracy gap than any post-processing or model upgrade can.
The second microphone characteristic that affects async training libraries is speaker-position consistency. Professional recording setups keep the microphone at a consistent distance from the speaker. Laptop built-in microphones move with the laptop: when the speaker leans forward to demonstrate something on-screen, the microphone gets closer and the recording level spikes; when they lean back, it gets quieter. This level variability creates recording-level inconsistency across a session that makes normalization more difficult and creates volume-dependent accuracy variation within a single recording. A section where the speaker is demonstrating a UI workflow from two feet back produces 3–5 dB less signal level than the sections where they are speaking directly to the camera, and that 3–5 dB reduction can push a 12 dB SNR section below the 10 dB threshold where accuracy degradation accelerates.
Room acoustics and reverb artifacts
Hard-surfaced home rooms — hardwood floors, plaster walls, glass windows — produce flutter echo at the dimensions typical of home offices (3–4 meter room depth with reflective surfaces at both ends). Flutter echo appears in the recording as a rapid series of reflections following each syllable, creating a characteristic reverberant quality that adds low-level copies of each phoneme following the direct-path signal by 15–50 milliseconds. Whisper's decoder handles light reverb reasonably well; it handles heavy reverb — recordings with a reverberation time (RT60) above 0.4 seconds — much less well, because the overlapping reflections create acoustic masking of the onset and offset of phonemes that are critical for consonant discrimination.
Room acoustic treatment — acoustic foam panels, bookshelf diffusion, soft furnishings — reduces RT60. A furnished room with books, a couch, and soft flooring typically has RT60 of 0.2–0.35 seconds. An unfurnished home office (bare walls, hardwood floor, no soft furnishings) may have RT60 of 0.5–0.8 seconds. The accuracy impact is approximately 2–4 percentage points on the baseline. Unlike the noise-floor and microphone problems, the room acoustics problem cannot be fixed in post-processing; dereverberation algorithms exist but are computationally expensive and introduce their own artifacts. The practical recommendation for room acoustics is preventive rather than remedial: treat the recording room with soft furnishings, or position the microphone close enough to the speaker that the direct-path signal dominates the reflected-path signal by at least 6 dB.
Recording compression artifacts
Loom, Zoom, and Microsoft Teams all compress audio during capture and storage. Loom uses a variable-bitrate AAC codec targeting 32–48 kbps for the audio track of screen recordings. Zoom cloud recordings encode audio at 32 kbps AAC in mono for standard meeting recordings. Microsoft Teams recordings use 32–48 kbps AAC in mono depending on the meeting configuration. These compression levels are appropriate for intelligible communication — a listener can understand a training video recorded at 32 kbps AAC — but they introduce codec artifacts that Whisper's training corpus did not specifically prepare for. AAC at 32 kbps removes frequency content above approximately 8 kHz and introduces pre-ringing artifacts at onset transients. Fricatives (s, f, sh, v, th) are the most affected phoneme class: their characteristic broadband noise pattern is modified by the codec in ways that reduce Whisper's confidence on these phonemes, contributing approximately 1–2 percentage points of additional WER on top of the acoustic environment degradation.
The compression artifact problem is worst for the combination of laptop microphone capture and platform compression: the recording starts with limited high-frequency content (laptop mic rolloff above 8 kHz), the platform compresses at 32 kbps (further attenuating the already-limited high-frequency signal), and Whisper attempts to transcribe the result with degraded fricative and sibilant representation. For Loom recordings specifically, the screen-recording pipeline adds an additional video-mux step that can introduce slight audio-video synchronization drift in longer recordings — typically less than 50ms but occasionally more — which doesn't affect transcription accuracy but does affect caption timing precision if the SRT file is generated from the Loom platform's internal transcript rather than from a clean audio extraction.
Measured accuracy impact by platform and environment
The following accuracy estimates are based on Whisper-large-v2 with no domain vocabulary (general model) on 10-minute training video samples with moderate technical vocabulary density (approximately 8% proper-noun rate). These figures represent baselines; domain glossary injection raises accuracy on specific vocabulary terms while the ambient-noise degradation floors remain:
| Recording environment | Microphone | Platform | Baseline WER | Effective accuracy |
|---|---|---|---|---|
| Controlled studio | Professional condenser (cardioid) | Direct file upload | 3–5% | 95–97% |
| Quiet home office | USB cardioid desk mic | Any (lossless export) | 5–8% | 92–95% |
| Quiet home office | Laptop built-in mic | Loom/Zoom/Teams | 12–16% | 84–88% |
| Active home office (HVAC on) | USB cardioid desk mic | Loom/Zoom/Teams | 9–13% | 87–91% |
| Active home office (HVAC on) | Laptop built-in mic | Loom/Zoom/Teams | 17–22% | 78–83% |
| Noisy home office (street/household noise) | Laptop built-in mic | Loom/Zoom/Teams | 22–30% | 70–78% |
The WCAG 2.1 AA 99% accuracy standard requires a 1–5% WER depending on measurement protocol. The gap between the home-office baseline (12–22% WER in typical conditions) and the compliance target (1–5% WER) is 7–20 percentage points. A domain glossary with comprehensive vocabulary coverage can close 8–12 percentage points of the gap on domain-specific terms — but only on terms that appear in the glossary. The remaining gap, driven by ambient noise masking and acoustic degradation, must be closed through either source audio remediation or human review. The 99% accuracy benchmark post covers how the DCMP protocol measures this gap and why WER alone understates the compliance risk on technical vocabulary.
Remediating home-office audio: source versus caption layer
There are two points at which home-office audio quality can be improved: before the recording is made (source remediation) and before the recording is submitted for captioning (pre-processing remediation). Source remediation produces the largest accuracy gains and is the correct long-term investment for any team with recurring recording needs. Pre-processing remediation is the fallback for existing libraries and for team members who cannot or will not implement source changes.
Source remediation: the recording guidance playbook
The highest-ROI intervention for a distributed async training library is a recording guidance document that every team member receives when they are given access to the platform (Loom, Zoom, Vimeo) used for async training production. The guidance should be specific and opinionated — "record from your bedroom with the door closed and a USB microphone" is more actionable than "record in a quiet environment." The following recommendations, in order of accuracy impact:
- Microphone (highest impact): A cardioid USB desk microphone positioned 15–20 cm from the speaker, slightly off-axis to reduce plosive artifacts. Specific models that balance cost and accuracy improvement: Blue Snowball ($50), Audio-Technica ATR2100x ($80), Rode NT-USB Mini ($100). Any of these outperforms a laptop built-in microphone substantially. The key characteristic: cardioid pattern + close placement + desk stand or boom arm to maintain consistent distance during the recording.
- Noise control (second highest impact): Turn off HVAC during recording where possible (even a 15-minute window is enough for a typical async training module). Close windows and interior doors. Put phones on silent. If a pet is in the room, move it. The primary sources of variable noise that cause the largest accuracy drops are manageable with environmental awareness; the goal is not a silent room but the elimination of unpredictable noise bursts that Whisper cannot adapt to.
- Room treatment (third highest impact): If the recording room is a hard-walled home office, position the recording setup in the corner of the room with the most soft furnishings behind and to the sides. A bookshelf behind the speaker absorbs reflections effectively. A fabric couch or acoustic curtains on a window wall reduce RT60 meaningfully. The goal is RT60 below 0.35 seconds — achievable in most furnished rooms without dedicated acoustic panels.
- Recording level (moderate impact): Set the recording level so that the speaker's voice peaks at approximately −6 to −3 dBFS during normal speech. Too quiet (below −20 dBFS average) means the noise floor is proportionally louder; too loud (above −3 dBFS with clipping) introduces digital artifacts that degrade transcription accuracy. Both Loom and Zoom have recording level meters in their interfaces; use them before starting.
Pre-processing remediation: noise reduction before transcription
When source remediation is not feasible — for existing recordings already in the library, or for team members who will not change their recording setup — noise reduction applied to the audio before transcription submission can recover 3–7 percentage points of baseline accuracy. The key tools and their tradeoffs:
- RNNoise (open source): A recurrent neural network noise suppressor originally developed for WebRTC. Effective for stationary broadband noise (HVAC, fan noise, refrigerator hum) — the noise sources that dominate home-office recordings. Limited effectiveness against impulsive noise (keyboard clicks, intermittent household sounds). CPU-efficient; can process a 10-minute audio file in under 30 seconds on standard hardware. Available as an ffmpeg plugin, which makes it scriptable for batch processing of a large async library.
- Facebook Demucs / HTDemucs (open source): A source-separation model that can separate speech from ambient noise. More computationally expensive than RNNoise (10–30× processing time) but more effective on complex noise profiles including keyboard noise and variable household sounds. Not practical for batch processing of a large library without GPU access; best used for high-priority recordings with severe noise problems that domain glossary cannot fix.
- Adobe Podcast Enhance (SaaS): Adobe's AI audio enhancement API effectively removes the common home-office noise profiles. Produces perceptually clean audio with minimal speech artifact, which translates to 4–8 percentage points of ASR accuracy improvement in noise-heavy recordings. Requires submission to Adobe's API (data-handling review required for confidential training content). Priced per hour of audio, which makes it practical for a back-catalogue remediation sprint but not for ongoing production without budget allocation.
- Krisp / NVIDIA RTX Voice (real-time noise suppression): These tools suppress noise during recording rather than in post-processing. Krisp runs as a virtual audio device that applications like Zoom and Teams route through automatically. Most effective for the common home-office noise profiles and transparent to Whisper's transcription pipeline because the output is pre-processed clean audio rather than compressed recording + post-processing artifact. The limitation: they require the recording to be re-made; they cannot be applied to existing recordings.
For an existing async training library with a mix of audio quality levels, the practical approach is to triage before processing: run a quick accuracy assessment on a sample from each recording (10 minutes of audio, check the resulting WER), categorize into high (WER below 10%), medium (WER 10–20%), and low (WER above 20%) quality tiers, and apply noise reduction selectively to the medium and low tiers where the processing cost is justified by the accuracy gain. The LMS caption audit methodology covers the inventory and triage process for existing video libraries.
The vocabulary accuracy gap that noise reduction doesn't close
Noise reduction improves the baseline acoustic accuracy — it reduces the WER on clean general-vocabulary speech. It does not improve vocabulary-specific accuracy for domain terms that are outside Whisper's training vocabulary. A denoised recording of an engineer explaining a CI/CD pipeline still requires a domain glossary to correctly transcribe "Kubernetes," "Helm chart," "ArgoCD," and "Terraform apply" — because those terms are rare in the general ASR training corpus regardless of audio quality. The correct model is: noise reduction handles the acoustic problem (bringing the recording closer to the conditions Whisper was trained on), and domain glossary injection handles the vocabulary problem (biasing the decoder toward the terms specific to the training content). The two interventions are additive. A denoised recording with a domain glossary achieves substantially better accuracy than either intervention alone on recordings with both acoustic degradation and technical vocabulary density. The glossary-biased decoding post covers how glossary injection works and the vocabulary term selection methodology in detail.
Loom caption workflow
Loom is the dominant platform for async screen recording in remote teams: product walkthroughs, process documentation, design reviews, onboarding for specific tools. It is not a traditional LMS platform; it is a hosting platform with a shareable link, and training content lives in Loom until it is either embedded in an LMS or downloaded and re-uploaded to a dedicated LMS. The caption workflow depends on whether the Loom video is being used in-place (shared as a Loom link) or as a source file for LMS upload.
Loom's auto-transcript and its limitations
Loom generates an automatic transcript for every video using a general-purpose speech-to-text model. The transcript is searchable in Loom's interface and is used for Loom's in-platform caption display. The accuracy of the Loom auto-transcript is approximately equivalent to Whisper-medium on general English speech — adequate for conversational walkthroughs, problematic for technical content with domain-specific vocabulary. Loom does not support custom vocabulary, custom language models, or glossary injection at the platform level. Every Loom recording is transcribed with the same general model regardless of the content's vocabulary density. For a product manager walking through a new feature in plain English, the Loom auto-transcript may achieve 90–95% accuracy. For a software engineer explaining a new API endpoint with SDK function names, library names, and configuration syntax, the Loom auto-transcript may achieve 75–85% accuracy on the technical terms specifically.
The Loom auto-transcript is also generated immediately on upload, before any post-processing or noise reduction can be applied. If the recording has home-office audio degradation, the transcript is generated against the raw audio. There is no path to re-generate the Loom transcript after audio correction without re-uploading the recording.
Editing the Loom transcript
Loom Business and Enterprise plans allow transcript editing in the Loom interface. The editor shows the transcript text synchronized with video playback; clicking a word in the transcript jumps playback to that point. The correction workflow: play through the recording, pause at errors, correct in the editor. The edited transcript is stored in Loom and updates the in-platform caption display immediately on save. This is the correct path for correcting domain-specific vocabulary errors in a Loom caption without going through an external captioning workflow — it is faster than exporting, correcting in an SRT editor, and re-uploading.
The limitation of in-Loom editing: it cannot add speaker labels or formatting (line breaks, paragraph structure) beyond what the automatic transcript produces. For short walkthroughs (under 5 minutes), in-Loom editing is the most efficient path. For longer recordings (10+ minutes) or recordings with substantial accuracy problems, export and re-upload is faster because external SRT editors allow find-and-replace for systematic errors (all instances of "cubernetes" → "Kubernetes") that in-Loom editing requires fixing one at a time.
SRT export and LMS upload
Loom Business and Enterprise plans support SRT export from the video's detail page (the three-dot menu → "Download transcript" → "SRT"). The SRT file is generated from Loom's internal transcript, including any corrections made via the in-platform editor. The timestamp precision of the exported SRT is approximately 0.5–1.0 second, which is adequate for most training content but may not meet DCMP precision standards for content requiring exact word-level synchronization.
The SRT export workflow for LMS upload:
- Review and correct the Loom transcript in-platform (or download the SRT, correct in an external editor, and re-upload — see below).
- Download the Loom video file (Business plan or above: "Download video" → original quality; this produces an MP4 with the original audio track, uncompressed relative to the Loom stream).
- Upload the MP4 to the LMS.
- Upload the corrected SRT file to the LMS as the caption track for the uploaded video asset.
- Set the caption track language code (typically en-US or en-GB), label ("Captions"), and default-on status in the LMS caption track settings.
The common mistake: uploading the Loom embed link to the LMS rather than the MP4 file. Loom embeds carry the Loom auto-captions when the Loom player is used. When the LMS wraps a Loom embed in its own player, the Loom captions may not display — the LMS sees an embed, not a video asset, and has no caption track to serve. The LMS cannot apply its WCAG-compliant caption display settings to an embedded third-party player. Always download and re-upload the video file for any training content that will be delivered through an LMS, rather than embedding the Loom link.
Loom webhook for automated caption triggering
Loom's developer API includes a video.created webhook event that fires when a new Loom recording is uploaded. For teams that want to move from reactive (manually submitted caption jobs) to proactive (automatically triggered caption jobs), a Loom webhook → captioning workflow automation is the correct architectural intervention. The webhook payload includes the video ID, the creator's email, and the Loom video URL; a downstream automation can use the Loom API to download the audio or video file, submit it to the captioning workflow, and upload the corrected SRT back to Loom once the job is complete. Implementing this automation requires Loom Business or Enterprise API access and a captioning vendor that supports programmatic job submission — which is the difference between a caption operation that closes the distributed-production gap and one that continues to miss recordings from non-L&D producers.
Zoom recording caption workflow
Zoom is used for two distinct types of training video in remote and hybrid teams: live training sessions (webinars, instructor-led training sessions) that are recorded for async replay, and purpose-built async recordings where the presenter records to their local device or Zoom cloud with no live audience. Both produce Zoom recordings that require captioning, but they have different audio characteristics and different production contexts.
Zoom cloud recording versus local recording
Zoom offers two recording paths: cloud recording (saved to the Zoom cloud and accessible from the Zoom web portal) and local recording (saved to the local device as an MP4). The caption workflow differs:
Cloud recordings: Zoom can generate automatic transcriptions for cloud recordings if the "Automated captions" or "Cloud recording transcription" feature is enabled at the account level (Settings → Recording → Cloud recording transcription). Enabling this feature causes Zoom to generate a VTT transcript file alongside the MP4 recording in the Zoom cloud. The transcript is powered by a third-party ASR service and achieves approximately 85–90% accuracy on general English speech with no domain vocabulary support. The VTT file is downloadable from the recording detail page in the Zoom web portal (Recording → select recording → "Audio transcript" → Download).
Local recordings: Zoom does not automatically generate captions for local recordings. The MP4 file must be submitted to an external captioning workflow. The audio quality of local Zoom recordings is typically better than cloud recordings because local recordings are not compressed through Zoom's cloud recording pipeline — the raw audio from the selected input device is encoded directly to the local file, whereas cloud recordings undergo an additional server-side transcoding step. If the team has the option of cloud versus local recording for async training content, local recording produces higher-quality source audio for captioning.
Zoom audio quality considerations
Zoom's meeting audio undergoes a proprietary noise suppression and echo cancellation pipeline that runs in real time during the call. This pipeline is beneficial for intelligibility during live meetings but can introduce artifacts in the recorded audio that affect ASR accuracy. Specifically: Zoom's background noise suppression (the "Suppress persistent background noise" setting) uses a noise reduction algorithm that can occasionally remove or attenuate speech segments that it misclassifies as background noise, particularly for speakers with softer voices or accents that the noise suppression model was not trained on. The "Auto" setting for background noise suppression is the most aggressive and the most likely to produce artifacts; the "Low" setting is recommended for training recordings where transcription accuracy is a priority.
Zoom also applies echo cancellation that can cause issues when the recording includes screen-sharing audio (system audio through speakers while the microphone is active). If the training recording includes a demo with sound effects, narrated software walkthroughs, or any system audio, the echo cancellation may suppress portions of the system audio or create artifacts where the microphone picks up the speaker audio played through speakers simultaneously with the microphone input. For training recordings that include system audio, using Zoom's "Share computer sound" option (which routes system audio directly to the recording rather than through the microphone) produces better transcription accuracy on the system audio segments.
Downloading SRT from Zoom and uploading to LMS
For cloud recordings with automatic transcription enabled, the SRT download is available from the Zoom web portal recording detail page. The file naming convention is typically "{meeting_topic}_{date}.vtt" — Zoom downloads as VTT, not SRT, despite calling it a transcript. Both SRT and VTT are supported by most LMS platforms, but confirm the destination LMS's accepted formats before downloading. If the LMS requires SRT (SAP Litmos and some Cornerstone configurations accept only SRT), use an online converter or ffmpeg to convert the VTT to SRT before upload.
The Zoom VTT includes speaker labels from the meeting participant list if "Identify speakers in CC/transcript" is enabled in Zoom settings. Speaker labels in the VTT carry into the LMS caption display, which can be valuable for multi-presenter recordings but can also introduce errors if Zoom's speaker identification is imprecise (a common issue when multiple participants speak from the same location or when participant names are not consistently set). Review the speaker labels during the accuracy spot-check and correct any mis-labeled speaker segments before LMS upload.
For recordings without automatic transcription, or for recordings where the automatic transcript accuracy is below the 99% WCAG standard, the workflow is: download the Zoom cloud recording MP4 → submit to external captioning workflow → receive corrected SRT/VTT → upload to LMS with track metadata (language code, label, default-on). The vendor RFP playbook covers how to evaluate captioning vendors on accuracy, turnaround, and glossary support for exactly this workflow.
Zoom-to-LMS delivery paths
Zoom recordings can reach learners through several delivery paths, each with different caption-handling characteristics:
- Direct Zoom share link: Learners click a link to the Zoom recording page. Captions display in the Zoom player using the VTT transcript. This delivery path is not LMS-trackable (no completion data) and is not recommended for compliance training.
- Embedded Zoom recording in LMS: The Zoom recording page can be embedded in an LMS content object via iframe. Caption display depends on the Zoom player within the iframe. LMS players do not have access to the Zoom caption track. This delivery path has the same caption control limitations as Loom embed — the LMS cannot serve its own caption track because the content is served by Zoom's player. Not recommended for WCAG-compliance-critical content.
- MP4 download and LMS upload: The preferred path for compliance training. Download the MP4 from Zoom → upload to LMS → upload validated SRT/VTT caption file → set track metadata. The LMS serves the caption file with full control over language code, label, default-on status, and player caption display settings. This is the only delivery path that meets WCAG 2.1 AA SC 1.2.2 for employer-controlled caption compliance.
Microsoft Teams and Stream caption workflow
Microsoft Teams recordings and Microsoft Stream represent the dominant video infrastructure for remote and hybrid teams in Microsoft 365 environments. The Teams recording → Stream storage → SharePoint/LMS delivery path is the standard enterprise async video workflow, and it has more caption-related moving parts than any other platform combination in this guide.
Teams recording → Stream storage
Teams meeting recordings (both scheduled meetings and ad-hoc calls with recording enabled) are stored in OneDrive (for meetings initiated by an individual) or SharePoint (for channel meetings). Stream is Microsoft's video portal built into Microsoft 365; it provides a video player with caption support, transcript display, and search-within-video functionality. All Teams recordings in Microsoft 365 (post-2021 architecture, where recordings no longer go to the old Classic Stream) are stored as MP4 files in OneDrive or SharePoint and played through the Stream player when accessed from within Microsoft 365 applications.
Teams generates automatic captions for recordings through Azure Speech Services. The auto-transcript is generated asynchronously after the recording ends, typically within 1–4 hours depending on recording length and service load. The auto-transcript appears in Stream as a searchable transcript and is used for Stream's in-player caption display. Like Loom and Zoom, the Teams/Stream auto-captions use Azure Speech's general model without custom vocabulary by default. The accuracy is approximately 85–90% on general business English and degrades on domain-specific vocabulary in the same pattern as other platforms — adequate for general conversation, problematic for technical training content.
Microsoft Stream VTT upload workflow
Microsoft Stream supports uploading a custom VTT caption file to replace or supplement the auto-generated transcript. The upload path: Open the video in Stream → Edit (pencil icon or "More options" → "Update video details") → Captions tab → Add captions → Upload a .vtt file. The uploaded VTT file is stored alongside the video in SharePoint and served by the Stream player as the caption track. If both an auto-generated transcript and an uploaded VTT file exist for the same video, Stream displays the uploaded file as the default caption track (it takes precedence over the auto-transcript). Uploaded VTT files can be set to specific language codes and are displayed in the Stream player's language-selection menu.
The VTT format requirements for Stream: standard WebVTT format with WEBVTT header, cue identifiers optional, timestamps in hh:mm:ss.mmm format, line breaks between cues. Stream does not support SRT directly for caption upload — SRT files must be converted to VTT before upload. The ffmpeg conversion command: ffmpeg -i input.srt output.vtt. Alternatively, online converters are available, but data-handling considerations apply for confidential training content.
One important limitation: the Stream VTT upload is per-video at the SharePoint item level. If the same recording is stored in multiple SharePoint locations (due to channel-meeting recording routing), the VTT must be uploaded separately for each copy. Teams channel meeting recordings are stored in the SharePoint document library for the channel; personal meeting recordings are stored in the recorder's OneDrive. A recording of a training session that is later shared to multiple Teams channels creates multiple SharePoint copies, each requiring a separate VTT upload for the caption track to appear in all delivery locations. This is a structural limitation of the Teams/SharePoint storage model that has no automated workaround — the operations team must track which copies exist and ensure each has a validated caption track.
Microsoft 365 Custom Speech and vocabulary support
Azure Custom Speech Service, available as part of Microsoft Azure Cognitive Services, allows creation of custom language models and pronunciation dictionaries for Azure Speech transcription. In theory, this should enable domain vocabulary support for Teams auto-transcription. In practice, the integration path between Custom Speech and Teams/Stream transcription requires Microsoft 365 E5 licensing with Teams Premium and enterprise-level Azure subscription configuration. For most SMB and mid-market organizations on E3 licensing, custom vocabulary for Teams auto-transcription is not available without a significant licensing upgrade.
The practical workaround for organizations on E3 or below: treat Teams auto-captions as a first-draft transcript, download the VTT from Stream, submit to an external captioning workflow with domain glossary support, and upload the corrected VTT back to Stream. This is a two-step process (auto-generate → externally correct → re-upload) but achieves 99%+ accuracy on technical vocabulary without requiring the E5 licensing upgrade. The volume overhead — one additional upload step per recording — is manageable with a systematic workflow but breaks down for teams producing 20+ recordings per week without a defined caption operations role.
Viva Learning and SharePoint video delivery
Microsoft Viva Learning surfaces content from SharePoint to learners in the Microsoft 365 employee experience platform. Training videos hosted in SharePoint (whether from Teams recordings or direct SharePoint uploads) are accessible through Viva Learning. The caption track for Viva Learning delivery is served from the Stream player's caption configuration — meaning the VTT file uploaded to the SharePoint-hosted Stream video appears as the caption track in Viva Learning as well. There is no separate caption upload path for Viva Learning; it inherits from Stream. This is a simplification relative to traditional LMS delivery (one platform to configure instead of two), but it means that caption quality in Viva Learning is entirely dependent on the Stream VTT upload workflow being executed correctly for every video asset.
Microsoft Teams captions for recorded training: the compliance workflow end-to-end
The full end-to-end caption compliance workflow for Microsoft Teams training recordings in a standard Microsoft 365 E3 environment:
- Recording is made in Teams → stored in OneDrive (personal meeting) or SharePoint (channel meeting).
- Teams/Stream generates auto-transcript asynchronously (1–4 hours after recording ends).
- Caption operations receives notification of new recording (manual monitoring, SharePoint event webhook, or automation via Power Automate).
- Caption operations downloads the auto-VTT from Stream (Edit video → Captions → Download), submits to external captioning workflow with domain glossary.
- External captioning workflow returns corrected VTT with 99%+ accuracy on domain vocabulary.
- Caption operations uploads corrected VTT to Stream (Edit video → Captions → Upload VTT), sets language code en-US, confirms caption display in Stream player.
- If the recording is also accessible via Viva Learning, confirm caption display in Viva Learning (the Stream VTT update propagates automatically in most tenants within 15–60 minutes).
The critical delay in this workflow is step 3 — notification of a new recording. If the caption operations role is monitoring a dedicated SharePoint library that all training recordings are routed to, this step is a daily or weekly review of new uploads. If training recordings are distributed across individual OneDrive libraries (because personal meeting recordings do not automatically route to a central SharePoint location), step 3 requires either a Power Automate flow that monitors OneDrive event webhooks for new MP4 files and notifies the caption operations team, or a manual weekly sweep of known recording libraries. The structural difference between a distributed-individual and a centralized-library architecture is the most common source of uncaptioned Teams recordings in production Microsoft 365 environments.
Vimeo and Wistia caption workflows
Vimeo and Wistia serve different use cases in the remote training video ecosystem. Vimeo is used primarily for semi-public async content — company all-hands recordings, product demo libraries, customer-facing training in access-controlled portals. Wistia is used primarily for marketing-facing and customer education video, but is also adopted by L&D teams for its analytics and lead capture features on training content. Both platforms support SRT/VTT caption upload with consistent caption delivery, and both are substantially simpler to caption correctly than Teams/Stream because they do not have the multi-copy SharePoint routing problem.
Vimeo caption workflow
Vimeo Business and higher plans support automated caption generation using a third-party ASR service (Verbit for some plan tiers; a generic Whisper-based service for others — Vimeo does not publicly disclose the ASR provider and it has changed in different plan tiers). Vimeo auto-captions have accuracy roughly equivalent to Zoom's auto-transcription on general English — 85–92% on clean general speech, significantly lower on technical vocabulary. The auto-captions are a starting point, not a compliance-ready artifact.
The Vimeo SRT/VTT upload workflow: Video Manager → select video → Captions → Add caption file → select language → upload .srt or .vtt file → Save. Vimeo accepts both SRT and VTT formats. The uploaded caption file is served by the Vimeo player for both direct Vimeo viewers and embedded Vimeo players (Vimeo embeds carry the caption track regardless of the page the embed appears on, which is the correct behavior for LMS embeds). Vimeo's player has a "CC" button for learner caption toggle; the caption display default (on or off) is configurable in the video settings.
One Vimeo-specific consideration: Vimeo's privacy settings affect whether caption files are accessible to external users. Videos set to "Only me" or specific domain allow-lists will not display captions to users who are not authenticated to the correct Vimeo account or domain. For training videos delivered through a Vimeo embed in an LMS, confirm that the Vimeo video's privacy settings allow the domain of the LMS to display the video and its captions. A common error: the Vimeo video is domain-restricted to the company's main website domain, and the LMS subdomain (learn.companyname.com) is not in the allow-list, so the embedded player shows a permission error to learners rather than the training video.
Wistia caption workflow
Wistia supports SRT upload via the video's Advanced settings (Video Details → Advanced → Captions → Upload caption file). Wistia also offers auto-captioning powered by an ASR provider for Business plan accounts. Like other platform auto-captions, Wistia's auto-captions do not support domain vocabulary customization and achieve general-English accuracy levels.
Wistia's distinctive feature for training content is its Turnstile and heat map analytics: the platform can track exactly which viewers watched which segments, enabling L&D teams to identify learners who skipped through a training module rather than watching it fully. For training content with compliance certification requirements (where a learner must watch the full module to be certified), Wistia's engagement analytics provide the documentation that course completion alone doesn't. The caption workflow for Wistia follows the same SRT-upload pattern as Vimeo; the compliance documentation value comes from Wistia's analytics layer on top of the captioned content.
Wistia embeds carry captions in the Wistia player. For LMS delivery of Wistia-hosted content, the same Vimeo guidance applies: embed links carry the Wistia caption track; the LMS cannot apply its own caption layer to a Wistia embed. For strict WCAG compliance delivery through an LMS player with LMS-controlled caption settings, download the MP4 from Wistia (available on Business plan) and upload it directly to the LMS with the SRT file as the caption track.
Glossary architecture for distributed async training libraries
The glossary architecture for a centralized training video library (one or two L&D professionals producing all content in a consistent recording environment) is well-established: build a per-company vocabulary model around the three to five categories of domain terms most likely to cause ASR errors, seed it with 60–150 terms, and update quarterly. The customer glossary architecture post covers this in full. Remote and hybrid async libraries present three structural differences that require adaptation of this architecture.
Speaker diversity and accent variation
A centralized training library produced by two L&D professionals involves two speaker voice profiles. Whisper's decoder can implicitly adapt to these profiles within a few minutes of a recording, biasing toward the phoneme patterns characteristic of those speakers. A distributed async library produced by twenty subject-matter experts across five time zones involves twenty voice profiles, five or more regional accents, and variable recording equipment setups. Each recording is essentially a fresh acoustic context for Whisper.
Speaker diversity affects glossary performance in a specific way: glossary terms are injected into Whisper's decoding process as lexical constraints that boost the probability of specific token sequences. These constraints are most effective when the acoustic evidence for the term is strong (clear pronunciation, high SNR). When the acoustic evidence is weak (unfamiliar accent, poor recording quality, fast speech rate), the glossary constraint competes with phonetically plausible alternatives and may not win. A speaker who pronounces "Kubernetes" as "koo-bur-neh-tees" (Spanish-accented) produces phonemic evidence that maps less cleanly to the glossary term's pronunciation model than a speaker who pronounces it as "kyoo-bur-NEE-tees" (standard American pronunciation). The glossary improves accuracy on both speakers, but the improvement is larger for the speaker whose pronunciation aligns with the acoustic model the term was phonemized under.
The architectural response to speaker diversity: build the glossary with phoneme variant entries for terms that commonly have accent-dependent pronunciation variation. For a distributed team with significant accent diversity, each glossary entry should include two to three pronunciation variants rather than one. Most captioning platforms that support glossary injection allow phoneme-level pronunciation guidance (IPA or platform-specific phoneme notation) per term — use it for the 20–30 terms with the highest mispronunciation rate in the library's accuracy spot-checks.
Vocabulary consistency across producers
A distributed async library produced by twenty subject-matter experts will have vocabulary inconsistency that a centralized library doesn't: the engineering team calls the deployment process "pushing to prod"; the DevOps team calls it "deploying to production"; the documentation team uses "release to the production environment." These are the same concept expressed in three different ways, and a glossary built around "pushing to prod" will not help the accuracy of recordings that use the DevOps team's phrasing. For a distributed async library, the glossary must cover the vocabulary as speakers actually use it, not as a single L&D writer would standardize it.
The practical approach: build the initial glossary from text sources (internal documentation, Slack messages, product documentation, existing training scripts) and then run accuracy spot-checks on recordings from each major contributor group. The spot-check results will surface producer-specific vocabulary that the documentation-based glossary doesn't cover. Add these terms to the glossary as a second pass, explicitly noting which producer groups use which terms. The glossary becomes a de facto vocabulary audit for the organization's internal terminology, which has secondary value beyond caption accuracy.
Glossary update cadence for product-dependent content
Remote and hybrid teams produce training video at higher frequency than centralized production models, partly because the barrier to recording is lower (open Loom, start recording) and partly because the distributed expertise means more people producing content in their domain. Higher-frequency production means faster obsolescence of domain vocabulary: a product that ships a new feature quarterly produces quarterly vocabulary additions, and the glossary that was accurate for last quarter's recordings may miss 5–10 new terms in this quarter's recordings. The quarterly glossary sweep cadence that works for a centralized production model may need to be a monthly sweep for a high-frequency distributed production library. For organizations with weekly product releases, the glossary should be updated as part of the release process — the same week that new feature names and SDK changes are published internally, they should be added to the captioning glossary. The sales enablement captioning post covers the 48-hour SLA pattern for glossary updates tied to product release cycles.
Multi-platform glossary management
A distributed async library may use Loom for engineering walkthroughs, Zoom for recorded training sessions, Teams for compliance meeting recordings, and Vimeo for polished product demos — four platforms, four caption workflows, and potentially four separate glossary configurations if each platform has its own captioning vendor integration. The architectural goal is a single source-of-truth glossary that is consumed by all captioning workflows regardless of the source platform. In practice: maintain the master glossary in a shared document (Notion, Confluence, or a versioned CSV in a shared drive), and update each captioning vendor's vocabulary model from that master document on the quarterly (or monthly) sweep schedule. If different recording types use different captioning vendors (some teams prefer Loom's internal workflow; others submit to an external vendor), each vendor's vocabulary model should be synchronized with the same master glossary. Glossary divergence across vendors — where the Loom-internal model has been updated but the external vendor's model hasn't — is the most common source of inconsistent domain-term accuracy across a distributed library.
WCAG and ADA compliance for remote training video
The compliance obligations for training video do not have a remote-work exception. The shift to remote and hybrid work changed the production context but not the legal framework. WCAG 2.1 AA SC 1.2.2 ("Captions (Prerecorded)") requires synchronized captions for all prerecorded audio-video content — the delivery environment (LMS, Vimeo, Loom) does not affect the obligation. ADA Title I requires employers to provide effective communication to qualified individuals with disabilities, including for training and professional development materials. The shift to async video delivery makes this obligation broader, not narrower: there are more training videos than before, and each one is a separate captioning compliance obligation.
ADA Title I and remote work accommodation
ADA Title I covers employers with 15 or more employees. For employees with hearing disabilities — whether in-office, remote, or hybrid — employers must provide reasonable accommodations that enable effective participation in training and professional development. Before 2020, the accommodation for a hearing-disabled employee in an uncaptioned video training context was often a human interpreter or a transcript. In a fully async remote environment where training is delivered through a library of recorded videos without a live equivalent, the "interpreter on request" accommodation model fails: the employee cannot participate in the same training at the same time as their colleagues, which constitutes inequitable access to professional development. The DoJ's position in recent enforcement guidance is that synchronized captions — not transcripts, not on-request accommodations — are the expected format for prerecorded video training content made available to all employees, regardless of their disability status or remote/in-office status.
The practical implication: every training video in the async library that is accessible to employees — whether it is mandatory compliance training, optional skill development, or informational product updates — requires synchronized captions that meet WCAG SC 1.2.2. The scope is broader than the compliance-mandatory training content that most organizations prioritize. An employee with hearing loss who cannot effectively access the optional product training videos that their sighted colleagues are using to get up to speed on new features has a documented inequitable professional-development access situation regardless of whether the content is technically "mandatory."
WCAG SC 1.2.2 in the remote delivery context
WCAG SC 1.2.2 requires "Captions are provided for all prerecorded audio content in synchronized media, except when the media is a media alternative for text and is clearly labeled as such." The key requirements: synchronized (captions must appear at the same time as the corresponding audio, not as a separately linked transcript), for all prerecorded content (not just mandatory training, not just new content), in synchronized media (video with an audio track, which covers Loom recordings, Zoom recordings, Teams recordings, and all other platforms discussed in this guide). The "media alternative for text" exception applies only to videos that are direct visual equivalents of existing text content — a video recording of a product specification document being read aloud, where the text document is clearly the primary content. It does not apply to training videos that explain concepts in ways that are not fully captured in a text document.
The accuracy standard for WCAG SC 1.2.2 compliance is 99% word accuracy measured at the sentence level using the DCMP Captioning Key protocol. The 99% accuracy benchmark post covers what this means in practice — specifically, that vendor "99%" accuracy claims often use a different measurement methodology (corpus-level WER rather than sentence-level DCMP scoring) that overstates accuracy on the technical vocabulary that matters most. For remote async training libraries with home-office audio quality, achieving 99% sentence-level accuracy requires both noise remediation and domain glossary injection; neither alone is sufficient for recordings with significant technical vocabulary density.
Transcript versus captions: the delivery mode difference
A common misunderstanding in remote workforce caption compliance: "we have a transcript, so we meet WCAG." WCAG SC 1.2.2 requires synchronized captions, not a transcript. A transcript is a text document that a learner reads separately from the video; it meets WCAG SC 1.2.4 (Captions (Live)) for live content and helps meet SC 1.2.3 (Audio Description) for pre-recorded content, but it does not satisfy SC 1.2.2 for prerecorded synchronized media. The distinction matters for remote async libraries specifically because Zoom, Teams, and Loom all automatically generate downloadable transcripts — which some L&D teams treat as equivalent to captions. They are not. A transcript downloadable from the Zoom recording page is a separate text file. It satisfies the documentation requirement for "what was said in this meeting" but it does not provide synchronized caption access for a hearing-disabled employee watching the training video through the LMS. The learner must watch the video and simultaneously read a separate document, which is not equivalent access. Synchronized captions embedded in the video player — appearing on-screen in time with the audio — are the required format.
Documentation for remote caption compliance
Organizations subject to ADA Title I or Section 508 should maintain a compliance record for their async video training library that documents: the caption status of each asset (captioned / not captioned / in queue), the accuracy level of each captioned asset (last accuracy check date and score), the captioning vendor and glossary configuration used, and any remediation backlog with scheduled completion dates. For remote workforces where the async library is growing faster than it was pre-2020, the documentation becomes more important because the denominator (total captionable assets) is larger and growing faster. The compliance program build post covers the full documentation framework; the LMS audit methodology post covers how to inventory and triage a large existing library.
Eight failure modes
1. Treating platform auto-captions as WCAG-compliant
The most widespread failure mode in remote async caption operations. Loom, Zoom, Microsoft Teams, Vimeo, and Wistia all generate automatic captions. These captions are visible to learners. They appear to function as captions — text appears in the player in time with the audio. They are not WCAG-compliant: general-model auto-captions on technical training content achieve 78–90% accuracy, which is 9–21 percentage points below the 99% WCAG SC 1.2.2 standard. An organization that enables platform auto-captions and considers the caption obligation discharged has a documented systematic accuracy failure across its entire async library. The failure is invisible in normal operation — learners can see captions, the caption indicator shows "on," no error is reported — until an accessibility audit or an OCR complaint surfaces the accuracy measurement. The correct frame: platform auto-captions are a first-draft starting point for the captioning workflow, not a compliance-ready output.
2. Missing uncaptioned recordings from non-L&D producers
The distributed production model means that engineers, managers, sales team members, and product managers are all creating training recordings in Loom, Zoom, and Teams without a captioning workflow being triggered. The L&D team's caption workflow is designed for content that L&D creates; it has no visibility into content that others create and share directly to the LMS or to a Vimeo/Wistia video portal. The failure is invisible until the next compliance audit surfaces 150 uncaptioned recordings from the last six months of distributed production. The fix requires moving the caption workflow trigger from "L&D submits video" to "any video published to the LMS or shared as training content triggers a caption job" — which requires either API automation (Loom/Zoom webhooks, SharePoint event flows) or a mandatory content review step before any video can be published to the LMS.
3. Home-office audio causing systematic accuracy failure on specific speakers
A glossary-tuned caption operation may achieve 99% accuracy on recordings from the L&D team using professional microphones in treated recording spaces and 82–87% accuracy on recordings from specific remote team members using laptop microphones in reverberant rooms. The failure is systematic and speaker-specific — not random noise but a predictable accuracy gap for a subset of producers. It is invisible in the aggregate accuracy metrics if accuracy is measured across the whole library rather than per producer. The fix requires per-producer audio quality assessment (the pre-processing triage described in the audio remediation section) and targeted intervention: recording guidance for the affected producers, or a mandatory audio pre-processing step for recordings below a threshold SNR before they enter the captioning workflow.
4. Zoom or Loom embed in LMS without a separate caption track
The convenience path for distributing async training video is to copy the Zoom cloud recording link or the Loom share link into the LMS as a web content object. The video plays in the LMS frame. The platform-provided captions appear in the platform player. The failure: the LMS cannot enforce caption-display defaults (e.g., "captions on by default for all learners") on an embedded third-party player, cannot track individual learner caption usage, and cannot serve a validated corrected caption file through the LMS player's caption engine. When a learner reports a caption problem, the LMS administrator has no ability to fix it — the caption is controlled by Loom or Zoom, not the LMS. When an auditor asks for documentation of the caption compliance status of embedded content, the LMS provides no caption-status data because it has no caption track for embedded content. Download the MP4, upload it to the LMS, upload the SRT as a separate track — this is the only path that gives the LMS full caption control.
5. Teams recording not routed to central SharePoint library
Teams personal meeting recordings go to the meeting organizer's OneDrive, not to a shared SharePoint library. If the caption operations workflow monitors only a central SharePoint location (the standard setup for most Microsoft 365 caption workflows), all personal meeting recordings are invisible to the workflow. A training session recorded as a personal Teams meeting — the most common recording type for one-on-one onboarding sessions, small-group workshops, and ad-hoc training calls — never triggers a caption job. Over a quarter, this produces dozens to hundreds of uncaptioned personal meeting recordings distributed across individual OneDrives. The fix requires either a policy requiring meeting organizers to move recordings to the central SharePoint location after recording (enforced by procedure, not by the platform), or a Power Automate flow that monitors OneDrive event webhooks for new MP4 files and routes them to the caption workflow automatically.
6. SRT timing drift in Loom recordings with screen-sharing segments
Loom recordings that include screen sharing generate audio-video synchronization data from the screen recording pipeline, which operates at a different frame rate than the video capture pipeline. For recordings over 15 minutes with multiple screen-switching events, the Loom internal transcript may have timing drift of 2–5 seconds in segments after screen-sharing transitions. The caption text is correct but the timing is off — captions appear 2–5 seconds late relative to the corresponding audio, which is perceptually jarring and fails DCMP timing precision standards. The failure is not visible in the Loom player (which uses the internal timing) but becomes visible when the SRT is exported and loaded into an LMS player that respects the SRT timestamps strictly. The fix: review SRT timing in a dedicated SRT editing tool (Subtitle Edit, Aegisub) after export and before LMS upload, and correct any segments where timing drift exceeds 1 second.
7. Caption track language metadata not set in platform or LMS
A VTT file uploaded to Microsoft Stream or Vimeo with no language code set is stored as an unidentified caption track. In multilingual Microsoft 365 tenants, the Stream player selects the caption language based on the learner's Microsoft 365 display language setting — if the caption track has no language code, the player may not display it for learners whose display language doesn't match the undefined track language. Similarly, a Vimeo caption track uploaded without a language code appears in the player menu as "Unknown Language" rather than "English" — which reduces learner click-through on the caption toggle for learners who want to read in their native language but are unsure what language the unidentified track contains. Always set the language code (en-US, fr-FR, de-DE, etc.) when uploading a caption file to any platform, regardless of whether the organization currently has learners who need multiple-language caption tracks. The language metadata is required infrastructure for future multilingual caption delivery and costs nothing to set correctly now.
8. Accuracy spot-check not covering home-office audio recordings
A caption QA process calibrated for professional studio recordings will pass home-office audio recordings at the wrong accuracy threshold. If the QA spot-check samples ten 5-minute segments from across the library and those ten segments happen to be from the L&D team's studio recordings, the accuracy measurement will be 99%+ and the QA process will pass the library as compliant. The home-office recordings from distributed producers — which may have 82–88% accuracy — are invisible in the QA results. The fix requires stratified sampling: the QA spot-check should explicitly sample recordings from distributed producers (non-L&D employees using Loom, Zoom, or Teams) at a rate proportional to their share of the library's total content volume. If 60% of the library's videos were produced by non-L&D employees since the shift to remote work, 60% of the QA sample should come from that group. The QA methodology post covers stratified sampling procedures and the error-type taxonomy that helps identify whether accuracy failures are acoustic (home-office audio) or vocabulary (missing glossary terms).
FAQ: async captioning for remote and hybrid teams
Do we need to caption internal Loom walkthroughs that are only shared as links, not published to the LMS?
Yes, if those Loom walkthroughs are training or professional development content accessible to employees. The ADA Title I obligation for effective communication applies to training and professional development materials regardless of the delivery mechanism — the obligation is triggered by the content's purpose (it is training that employees are expected to engage with) and the audience (employees, who are covered by ADA Title I), not by whether it is published to a formal LMS. A Loom library shared via a team Slack channel as "watch this walkthrough before your first day" is training content. A Loom recording of a new feature demo shared in a product channel as "watch this to understand the new release" is professional development content. Both require captions for employees with hearing disabilities under ADA Title I. The practical scoping question for most teams is not "which Loom videos legally require captions" but "what is our threshold for captioning, given the volume of Loom content being produced?" A reasonable threshold: any Loom video that is formally shared as training or onboarding content (shared to a team channel or onboarding document rather than in a one-on-one message) should be captioned. Casual one-on-one Loom messages (equivalent to a Slack message with a video attachment) are in a grayer compliance zone — most organizations treat them as informal communication rather than training content and apply judgment rather than a mandatory caption policy.
We have a large backlog of Zoom recordings from the last two years that were never captioned. Where do we start?
Start with a coverage and triage inventory, not with captioning everything at once. The inventory step: pull a list of all Zoom cloud recordings from the Zoom admin portal (Reports → Usage → Cloud Recordings) for the past 24 months, sort by recording length, and cross-reference against the LMS to identify which recordings were published as training content versus casual meeting recordings. The caption compliance obligation applies to training-purpose recordings; informal meeting recordings that were never published to the LMS are lower priority. From the training-purpose recording list, apply the compliance tier framework from the LMS audit methodology: Tier 1 (mandatory compliance training, safety training, required onboarding) → Tier 2 (role-specific development content) → Tier 3 (optional development content). Caption Tier 1 content first, in order of last access date (most recently accessed content first — those are the videos learners are actively engaging with). For a 200-video backlog, a realistic captioning throughput with an external vendor on 24-hour turnaround is 40–50 videos per week, meaning a 4–5 week sprint to clear Tier 1 content if the submission and review workflow is systematized. Budget for the sprint before starting — a 200-video backlog of 30-minute training recordings is 100 hours of audio, and professional captioning at $1–2 per minute comes to $6,000–$12,000. If that exceeds the available budget, prioritize by compliance tier and surface the remaining backlog to the accessibility program as a documented known gap with a remediation schedule.
Our Microsoft Teams auto-captions look accurate enough. Do we still need to run them through an external captioning workflow?
Depends on the content's vocabulary density. The question to answer: run an accuracy spot-check on a representative 10-minute segment using the DCMP protocol (methodology here) and measure the word-level accuracy specifically on proper nouns, product names, regulatory terms, and technical vocabulary. If the spot-check shows accuracy above 98% on those specific terms, Teams auto-captions may be adequate for your content. If the spot-check shows accuracy below 95% on domain-specific terms — which is the typical result for technical training content in engineering, product, compliance, or healthcare verticals — the auto-captions are not WCAG-compliant and require correction. The likely result: Teams auto-captions achieve 88–93% word-level accuracy on general business-English speech but 72–85% accuracy on content with significant technical vocabulary density. For a software company's engineering onboarding recordings, or a healthcare organization's clinical training recordings, Teams auto-captions will consistently fail the domain-vocabulary accuracy check. For a content type with low technical vocabulary density (general management training, soft-skills content, HR policy recordings), Teams auto-captions may pass the accuracy threshold. The shortcut to deciding: count the domain-specific proper nouns in the first 5 minutes of the recording. If there are more than 10 terms that are specific to your company's domain, assume Teams auto-captions will fail the accuracy check and plan for an external correction step.
Can we use AI-generated captions from a tool like ChatGPT or Claude to correct auto-caption errors, rather than going through a captioning vendor?
This approach works for light correction of occasional errors but breaks down at scale and for systematic accuracy failures. The specific problem: AI text correction tools (including Claude, ChatGPT, and similar) correct text based on context, not based on the original audio. If a caption contains a domain-specific error — "Kubernetes" transcribed as "Cubernetes" — an AI text corrector may change it to the correct term if the surrounding context makes the term clear. But if the error is a proper noun appearing without context ("our SDK now supports Chronos" transcribed as "our SDK now supports Crona"), the AI corrector doesn't know whether "Chronos" is the correct term because it has no information about your specific product names. The AI can only correct errors it can infer from text context; errors caused by missing domain vocabulary are precisely the errors where the AI has no text context to infer from. The correct tools for this: a captioning vendor with domain glossary support (which closes the vocabulary accuracy gap at the ASR level) and a human review step for high-priority content (which catches remaining errors including the context-ambiguous ones). AI text correction is a useful supplemental step after glossary-corrected transcription to fix formatting and punctuation — it should not replace the domain-vocabulary correction step.
Our team uses Vimeo for hosting but the LMS doesn't accept Vimeo embeds well. What is the recommended delivery path?
Download the MP4 from Vimeo and upload it directly to the LMS as a native video asset. Vimeo Business plan allows full HD MP4 download ("Download" → "Original" or highest quality available). The downloaded MP4 carries the original audio track, which is the same quality as what was uploaded to Vimeo — Vimeo re-encodes for streaming delivery but the "original" download is the pre-encoding source file. Download the validated SRT file from the Vimeo video's caption settings simultaneously. Upload the MP4 to the LMS, then upload the SRT as the caption track for that video asset. Set the language code (en-US or appropriate locale), label ("Captions" or "English"), and default-on setting in the LMS. This gives the LMS full caption control: the caption track appears in the LMS player's caption UI, the default-on setting works correctly for the organization's accessibility policy, and the LMS can track individual learner caption usage if the platform supports usage analytics. The tradeoff: you now have two copies of the video (Vimeo and LMS), which doubles storage costs and requires keeping caption corrections synchronized across both locations. If the LMS copy is the delivery copy (what learners access), the Vimeo copy can remain as the production master without needing to be updated for caption corrections — the corrected SRT travels with the LMS copy, not the Vimeo copy. Establish a naming convention that makes the relationship between the Vimeo master and the LMS copy explicit so that future re-uploads don't accidentally overwrite the captioned LMS copy with an uncaptioned source file.
We are hiring engineers who will create Loom walkthroughs as part of onboarding new team members. How do we build captioning into the production workflow without slowing them down?
The workflow design goal for distributed technical producers is to minimize the captioning friction at the time of recording — because engineers who feel the captioning process slows them down will find ways around it, producing uncaptioned recordings by default. The recommended workflow architecture for a Loom-based engineering onboarding library: (1) configure a Loom webhook that fires on every new recording created in the team's Loom workspace and submits the recording to the captioning workflow automatically — the engineer records, the caption job starts within minutes without any action from the engineer; (2) the captioning workflow (external vendor with domain glossary) returns the corrected SRT within 24 hours; (3) an automated integration uploads the corrected SRT to the Loom video and, if the video is linked in the LMS, uploads it to the LMS caption track as well; (4) the engineer receives a notification that the caption is available and a prompt to review it (not to caption it — to review the output). This workflow imposes zero friction on the recording step. The only ask of the engineer is a 3-minute review of the corrected caption before the video is published to new hires. For engineers who want to move faster, the review step can be optional for videos below a certain vocabulary density (detected automatically by measuring the per-video glossary term match rate) — if the captioning system finds no glossary terms in the recording, it's likely low-technical-vocabulary content that the auto-transcript handles accurately enough for review to be skipped. This design treats captioning as infrastructure rather than friction, which is the right framing for getting distributed technical producers to comply consistently.
Does audio quality from home-office recordings differ enough between speakers that we need different glossary configurations per producer?
The glossary configuration should not be per-producer — it should be per-content-domain, applied uniformly across the library. What does vary per-producer is the audio pre-processing that runs before the glossary-injected transcription: a producer whose recordings consistently score below 10 dB SNR on the audio quality assessment should have their recordings run through noise reduction before submission to the captioning workflow; a producer whose recordings score above 15 dB SNR can submit directly. The glossary is a vocabulary tool, not an acoustic tool — it doesn't close the SNR gap. Applying a more aggressive noise reduction step to low-SNR recordings before glossary injection gives the glossary the clean acoustic signal it needs to actually improve domain-term accuracy. The practical operationalization: maintain a producer profile list in the caption operations documentation with each regular producer's typical audio quality category (based on the first accuracy spot-check on their recordings), and route new submissions through the appropriate pre-processing tier before vendor submission. This doesn't require individual glossary configurations — just a pre-processing gate that flags low-SNR recordings for noise reduction before they enter the same glossary-injected captioning workflow that all other recordings use.
Close the home-office audio gap in your async training library
The distributed production model that remote and hybrid work created produces more training video, in more places, at lower audio quality, than any centralized caption workflow was designed to handle. The vocabulary gap — where Loom, Zoom, and Teams auto-captions hit 78–88% accuracy on technical training content instead of the 99% WCAG standard — requires domain glossary injection to close. GlossCap applies your per-company vocabulary model to every recording regardless of which platform produced it: Loom walkthroughs, Zoom cloud recordings, Microsoft Teams meeting recordings, Vimeo-hosted content, or direct LMS uploads. The same glossary — your product names, SDK symbols, regulatory terms, internal acronyms — improves accuracy across every recording in the distributed library, not just the ones produced by L&D.
For teams with home-office audio quality problems, GlossCap's pre-processing pipeline applies noise reduction before glossary injection — treating the acoustic degradation and the vocabulary gap as two separate problems with two separate solutions, applied together. The result: recordings that start at 78–83% baseline accuracy reach 97–99% on the domain-specific terms that matter most for training compliance. The accuracy benchmarks by vertical document the baseline and post-glossary accuracy by content type; the embed widget demonstrates what glossary-corrected captions look like on a sample clip before you commit to a subscription. Compare vendor options at Rev vs GlossCap, 3Play vs GlossCap, or Verbit vs GlossCap — all three are evaluated specifically on home-office audio handling and domain vocabulary support.