AI Video Captioning · Published 2026-06-15
Captioning AI-generated training video: the TTS→STT accuracy paradox, timing drift in synthetic speech, and the caption workflow for Synthesia, HeyGen, Descript, and Lumen5
L&D teams adopted AI-generated video platforms for an obvious reason: Synthesia, HeyGen, Descript, and Lumen5 reduce the cost of producing a 10-minute training module from a filming day to under an hour. A narrator, studio, camera operator, and post-production editor disappear from the workflow. The script goes in; a professional-looking video with an AI avatar reading it aloud comes out. Update the script, re-render, new version in 20 minutes. For training operations teams producing dozens of modules per quarter, the production leverage is real.
What L&D teams did not fully anticipate is that the same feature that eliminates the human actor also creates a systematic captioning problem. Every AI video platform in the L&D stack uses text-to-speech (TTS) synthesis to generate the avatar's voice. When those videos reach an LMS — where they must carry WCAG-compliant captions under ADA Title I, ADA Title II (for public entities, enforced since April 2026), and the European Accessibility Act (in force since June 2025) — the captioning workflow encounters a round-trip accuracy problem that has no parallel in traditionally recorded video: TTS synthetic voices, when processed by the speech-to-text (STT) models used for auto-captioning, produce systematically lower accuracy than natural human speech. The paradox is precise: eliminating the human actor eliminates exactly the acoustic signal that makes speech recognition reliable.
This problem compounds two other failure modes that affect AI-generated video specifically. First, timing drift: AI platforms that derive captions from the script text produce accurate caption text but must synchronize it to the rendered TTS audio using estimated timing — and that synchronization accumulates error as videos grow longer. A Synthesia compliance training module at 20 minutes often has caption timing drift that exceeds the ±2-second threshold for WCAG 2.1 SC 1.2.2, even when every word is correct. Second, the LMS export workflow: the dominant workflow for publishing Synthesia or HeyGen video to an LMS is to export the MP4 and upload it — not to export both the MP4 and the SRT and upload both. When only the MP4 reaches the LMS, either no captions appear, or the LMS auto-captions the video by running STT on the TTS audio, triggering the accuracy paradox.
This post provides the complete technical and operational picture for L&D teams using AI-generated video. It covers the TTS→STT round-trip paradox in detail — why synthetic voice is harder to transcribe accurately than human voice, the five acoustic differences that drive accuracy loss on the technical vocabulary that matters most in L&D content, and the timing-drift mechanics in AI-generated video. It then covers each platform — Synthesia, HeyGen, Descript, and Lumen5 — with their caption architecture, known failure modes, export workflow, and correct WCAG-compliant caption production path. It also covers the LMS-specific delivery considerations that determine whether a correct caption file actually reaches the learner's player, and closes with eight failure modes, a seven-question FAQ, and the workflow for achieving 99% accuracy on AI-generated training video using a glossary-biased captioning approach.
TL;DR — six things every L&D team using AI-generated video needs to know
- AI avatar voices are systematically harder to transcribe than human voices. TTS synthetic speech differs acoustically from natural human speech in five dimensions that STT models weight heavily. Round-trip accuracy (script → TTS → STT → text) on technical L&D vocabulary is 73–84% for engineering, compliance, and medical content — well below the 99% WCAG 2.1 AA threshold. Soft-skills content fares better (91–95%) but still fails WCAG.
- Timing drift is a separate failure from accuracy. Synthesia and HeyGen derive captions from the script text (correct words) but must synchronize timing to estimated TTS positions. For videos under 3 minutes, drift typically stays within ±0.5 seconds. For videos over 10 minutes, drift regularly exceeds the WCAG ±2-second limit, producing compliant caption text with non-compliant timing.
- HeyGen's built-in captions are not script-derived. Unlike Synthesia, HeyGen generates captions by running STT on its own TTS audio output. This means HeyGen's built-in captions are subject to the full TTS→STT accuracy paradox before any LMS involvement. For technical training content, HeyGen's built-in captions will not meet 99% accuracy without a glossary-correction step.
- Descript Overdub is the hidden TTS→STT risk in human-narrated video. When an L&D editor uses Descript's Overdub feature to replace, fix, or add narration without re-recording, those Overdub audio segments carry the same TTS→STT accuracy risk when the exported MP4 is re-captioned by an LMS. Descript's own caption file is grounded in the typed text (accurate), but LMS auto-captioning of the exported MP4 does not use that ground-truth.
- The correct workflow keeps the script text in the caption chain. All four platforms produce or can produce captions grounded in the script text — which is always accurate because it is the exact text the avatar was asked to say. The failure occurs when that ground-truth caption file is lost at the LMS export or upload step. The solution is a five-step workflow: export both video and SRT → apply timing correction → apply organizational glossary → deliver SRT sidecar to LMS → verify in the learner player.
- ADA Title I, ADA Title II, and EAA apply to AI-generated training video identically to human-recorded video. The production method is not a legally significant factor. If a hearing-impaired employee is assigned a Synthesia onboarding module, the employer's obligation is identical to assigning a studio-recorded training video. The 2026 ADA Title II deadline that captured universities and public entities covers AI-generated video assigned to students and employees of those entities.
The AI video production wave in L&D
Why L&D teams adopt AI video platforms
The traditional training video production pipeline — script, talent, studio booking, recording session, editing, review cycles, final render — requires between 20 and 60 hours of effort per finished hour of training video for a well-resourced L&D team. At organisations where training video must be updated frequently (product launches, compliance updates, policy changes, onboarding for new roles), this production overhead becomes the binding constraint. Teams fall behind on updates, publish outdated content, or simply skip video for content that would benefit from it.
AI video platforms address this constraint directly. Synthesia's proposition — write a script, choose an AI avatar, get a rendered training video — reduces a 20-hour production cycle to under an hour for a 10-minute module. HeyGen adds face-cloning and voice-cloning features so organisations can create custom avatar presenters without a studio session. Descript's transcription-based editing model allows non-editors to update training video by editing text, with Overdub AI voice patching the audio automatically. Lumen5 converts text articles, blog posts, or knowledge-base entries into video directly, enabling scale-production of microlearning content from existing written materials. The production leverage is real and measurable.
Adoption has moved well past early-adopter phase. Synthesia reports over 50,000 companies using the platform, including 60% of the Fortune 100, specifically for L&D and corporate training applications. HeyGen's growth in L&D accelerated through 2025 as the platform added LMS-specific export presets and multi-language voice features. Descript is the standard tool in many podcast-turned-training-team L&D workflows. Lumen5 is widely used in customer-education and compliance teams that need to convert existing text documents into video at scale.
The caption assumption that doesn't hold
When L&D teams evaluate AI video platforms, caption compliance typically enters the conversation as a checkbox. Synthesia's marketing confirms captions are available. HeyGen's caption toggle appears in the video editor. Descript produces a transcript (which looks like captions). Lumen5 shows text on screen. The assumption — reasonable from the product surface — is that caption compliance is handled. In practice, this assumption breaks in at least three ways for each platform, and the failure mode differs depending on the platform's caption architecture.
The deeper structural problem is that AI video platforms optimise for the video production workflow, not the caption-to-LMS workflow. Captions in the platform editor and captions in the learner's LMS player are two different artifacts delivered through two different workflows, and the gap between them is where compliance fails. This post maps that gap for each platform and provides the workflow needed to close it.
The TTS→STT round-trip paradox: why AI avatar voice is harder to transcribe
Defining the round-trip
The TTS→STT round-trip is the path from known text to spoken audio to transcribed text when AI-generated voice is involved. In a Synthesia training module, the path is: (1) L&D author writes the script (accurate, typed text); (2) Synthesia's TTS engine synthesizes audio in which the AI avatar speaks the script; (3) the rendered MP4 reaches the LMS; (4) the LMS auto-captions the MP4 by running the audio through an STT model to recover the spoken text. Steps 1 and 4 should produce the same output — the script text. In practice, step 4 produces a systematically degraded version of step 1, and the degradation is not random. It concentrates on the phoneme sequences that are least common in the STT model's training data and least stable in the TTS model's synthesis — which is precisely where technical vocabulary lives.
The paradox is this: the accurate text exists at step 1. Captioning a Synthesia video could be trivial — use the script as the caption file. But the workflow that determines what the learner sees in the LMS player is not the script-extraction path; it is the audio-transcription path. And the audio-transcription path degrades exactly the vocabulary that L&D content exists to teach.
Why this problem didn't exist before AI video
When a human narrator records a training module in a studio, the audio is a natural human voice. STT models — including Whisper, which underlies most LMS auto-captioning — were trained primarily on natural human speech: broadcast audio, podcasts, meetings, lectures, phone calls, and read-aloud text by human speakers. The STT model's acoustic representations of phonemes are tuned to the natural variance in how human speakers produce speech: the irregular prosody, the co-articulation between adjacent sounds, the speaker-specific formant patterns, and the presence of breathing, disfluencies, and natural pacing variation.
When the recording is of a human reading a training script, even a carefully professional narration, the STT model processes audio that is within its training distribution. Accuracy is high — typically 87–95% on technical training content before glossary correction, depending on the vocabulary density. The Whisper accuracy benchmarks by vertical show where baseline STT accuracy sits before any correction step. Human-narrated training video sits in a range that glossary correction can lift to 99%.
AI avatar video replaces the human narrator with TTS synthesis. This does not simply swap one audio source for another at equivalent quality. TTS synthesis produces audio that sounds natural to human listeners — rhythm, intonation, and voice quality are compelling — but differs systematically from natural human speech in the acoustic dimensions that STT models weight most heavily. The result is that the STT model processes audio that looks like speech but sits outside its core training distribution in the dimensions it uses to discriminate among phonemes. Accuracy on unusual phoneme sequences — which is exactly where product names, regulatory acronyms, and technical identifiers live — falls substantially.
The range of the accuracy gap
The accuracy gap between human voice and AI avatar TTS voice is not uniform. It concentrates on technical vocabulary and grows with vocabulary specificity. Soft-skills content — communication skills, leadership development, interpersonal conflict training — contains vocabulary within the STT model's high-frequency range. Round-trip TTS→STT accuracy on this content is typically 91–95%, compared to 95–98% for human-narrated equivalents. Both are below the 99% WCAG 2.1 AA threshold, but the gap is manageable with a focused correction step.
Technical L&D content produces a sharper gap. Engineering and DevOps onboarding video contains dense concentrations of product names, SDK identifiers, API terminology, and acronyms. Round-trip TTS→STT accuracy on this content type is typically 73–82%, compared to 87–93% for the same content narrated by a human speaker. Compliance and regulatory training — which contains specific law citations, agency names, procedural terminology, and regulatory identifiers — runs 76–84% TTS→STT vs 84–92% human. Medical training is the lowest-performing category: 69–78% TTS→STT on pharmacology, procedure, and anatomy vocabulary, compared to 83–90% human. All of these are far below the 99% threshold, and the gap to threshold is large enough that auto-captioning without correction cannot produce compliant output on AI-generated technical training video.
The five acoustic differences that drive accuracy loss
1. Prosodic regularity
Natural human speech is prosodically irregular in a structured way. Stress patterns, speech rate, and intonation contours vary within a speaker session and across speakers in ways that provide the STT model with prosodic boundary information — signals that indicate where words begin and end, where phrases group, and where semantic emphasis falls. These signals are not encoded in the lexical sequence; they are carried acoustically in the f0 contours, duration patterns, and intensity variations of the speech signal.
TTS synthesis produces prosodic patterns that are generated by the TTS model's prosody module. These patterns sound natural to human listeners — the TTS voice doesn't sound monotone — but they are generated by a model rather than produced through the irregular articulatory dynamics of a real speaker. The TTS prosody module has learned plausible prosodic patterns, but the specific prosodic envelope in any given utterance is a model generation, not a real-speaker production. STT models calibrated on real-speaker prosodic signals encounter TTS prosodic patterns that sit systematically outside the variance range of their training data. The result is reduced reliability on prosodic boundary detection, which cascades into reduced accuracy on word segmentation — the step that precedes acoustic phoneme decoding.
2. Formant precision
Vowel identity in human speech is carried by formant frequencies — the resonance peaks of the vocal tract at specific frequencies. Real human speakers produce vowels with natural variance: the same speaker producing the same vowel across multiple utterances will place the formant at slightly different frequencies each time, because real articulatory movements are not mechanically precise. STT models are trained on this natural variance and have learned to classify vowels within formant-frequency distributions that span the range of real-speaker variation.
Modern neural TTS synthesis produces formants with high consistency — the same vowel in the same synthetic voice will land at nearly identical formant positions across utterances, because the TTS model has learned a fixed-parameter mapping from phoneme to acoustic output. This precision is outside the variance distribution the STT model expects. For common vowels in common phonetic environments, the STT model still classifies correctly because the formant is within the high-probability region of its learned distribution. For uncommon phoneme sequences — including those in low-frequency vocabulary — the formant-position consistency of TTS synthesis can interact with the STT model's uncertainty boundaries in ways that produce incorrect classification. This is a second acoustic mechanism contributing to the TTS→STT accuracy gap on technical vocabulary.
3. Absence of disfluency signals
Natural human speech contains disfluencies — ums, uhs, false starts, self-corrections, breathing pauses, and filler words — that appear to be noise but serve a prosodic boundary function. STT models trained on natural speech have learned to use these disfluency signals as word-boundary and clause-boundary markers. The absence of disfluencies in TTS audio is not simply the absence of noise; it removes an acoustic signal class that the STT model uses for temporal alignment of the decoded word sequence to the audio timeline.
In practice, this contributes to the timing accuracy issue that affects all STT systems processing TTS audio: word-boundary placement is less reliable when the prosodic boundary signals are absent. For caption timing purposes, this means the word timestamps generated by STT on TTS audio are less precise than word timestamps generated by STT on human speech. Less precise word timestamps produce lower-quality caption timing synchronization, which is a WCAG concern separate from word-level accuracy.
4. Co-articulation differences
Adjacent phonemes in natural speech influence each other through co-articulation: the articulation of each phoneme is modified by the articulatory positions required for the preceding and following phonemes. This creates a continuous acoustic signal in which phoneme boundaries are not discrete and phoneme realizations are context-dependent. STT models are trained on this co-articulation-blended signal and model phoneme recognition as a sequence problem (using CTC or attention-based architectures) that exploits these context dependencies.
TTS synthesis — particularly neural TTS — models co-articulation in its acoustic output, and modern TTS voices sound natural in part because co-articulation effects are present. However, the co-articulation patterns generated by a TTS model differ systematically from those produced by a real speaker for the same phoneme sequence. These differences are subtle enough to be inaudible to a human listener but fall within the range that the STT model's phoneme-discrimination boundaries are sensitive to. For common, high-frequency phoneme sequences, the differences are small enough that classification is unaffected. For rare phoneme sequences — technical vocabulary — the differences can push the acoustic signal across the STT model's classification boundary, producing an incorrect phoneme assignment.
5. Cross-speaker variance patterns
STT models are trained on audio from many different speakers and have learned to handle speaker variation as a source of acoustic diversity. A word produced by a male adult speaker with a Southern US accent is acoustically different from the same word produced by a female adult speaker with a British accent, and the STT model classifies both correctly because it has seen both in training. This cross-speaker generalization is a strength of modern STT models on natural speech.
A TTS voice is a single acoustic persona — a fixed set of model parameters that produce a consistent acoustic signal. The STT model's cross-speaker variance training does not help with TTS audio because TTS audio doesn't have the kind of variance the training was designed to handle. More concretely: a Synthesia avatar named "Alex" produces the same acoustic signal for the same phoneme sequence across every video. If that signal happens to sit near a classification boundary for a specific phoneme sequence (as uncommon phoneme combinations in technical vocabulary are more likely to do), every occurrence of that phoneme sequence in every Alex-narrated video will be misclassified. The error is not random; it is systematic for that specific (avatar, vocabulary item) combination. A team using the same Synthesia avatar for all 50 of their compliance training modules will see the same STT error on "FCPA" in all 50 modules.
Why the script text is simultaneously the solution and why it is not used
The correct text exists. In Synthesia, HeyGen, and Descript, the script or typed text is the definitive record of what the AI avatar was asked to say. Caption accuracy is trivially achievable: use the script as the caption text. The entire TTS→STT paradox disappears when the script text — not STT reconstruction — is used as the caption source.
The failure occurs because two separate workflows produce two separate artifacts (the video and the caption file), and the caption file is the one that gets lost. The LMS export workflow in most L&D teams is: export the video → upload to LMS. The caption file requires a separate export action, a separate upload step, and often a different upload interface in the LMS than the video upload. When teams learn Synthesia or HeyGen for the first time, they learn the video export. The SRT export is a secondary interface action that is rarely included in the initial training or the standard publishing checklist. The solution is not technical; it is workflow: make SRT export and upload a mandatory step in the publishing checklist with the same status as the video upload step. But because the current failure is invisible — the LMS shows a CC button, learners can attempt to activate captions, and the captions appear to be present until a hearing-impaired learner discovers they are inaccurate — the workflow gap persists.
Timing drift in synthetic speech: WCAG compliance by video length
How drift accumulates in AI-generated video
In Synthesia and HeyGen (which both generate video from script text), captions derived from the script must be synchronized to the rendered audio track. The platform assigns each phrase in the script to a time position in the TTS audio output. These time positions are estimates based on the TTS model's synthesis timing model — the model's prediction of how long it will take to synthesize each phrase. The problem is compounding: each phrase timing estimate has a small error, and those errors sum over the length of the video.
The error per phrase is small in absolute terms — typically 0.05 to 0.15 seconds per phrase in modern neural TTS systems. But a 10-minute training module contains approximately 800–1,200 word tokens and 80–120 phrase-level caption segments. If each caption segment's estimated timing drifts by an average of 0.10 seconds relative to the actual TTS audio position, the accumulated drift at segment 100 is 10 seconds. Even at a conservative 0.05 seconds per segment, a 15-minute video (150 segments) accumulates 7.5 seconds of drift.
Video encoding adds another layer of drift. When Synthesia renders the final MP4, the TTS audio track is synchronized to the video timeline through a rendering process that may introduce small timing differences between the TTS synthesis timestamps and the final video timeline. Frame-rate quantization (rendering to 24, 25, or 30 fps) truncates sub-frame audio positions to frame boundaries. The encoding process for the audio codec (AAC, typically) may introduce additional buffering latency. These effects are small individually but sum with the TTS synthesis timing error.
The WCAG synchronization threshold
WCAG 2.1 Success Criterion 1.2.2 requires captions that are "synchronized" with the audio. The WCAG definition of synchronized does not specify a numeric tolerance, but accessibility audit practice and the DCMP Captioning Key treat ±2 seconds as the outer limit of acceptable synchronization. Caption segments that display more than 2 seconds before or after the corresponding audio are observable as visually de-synchronized, creating a cognitive load for deaf and hard-of-hearing viewers who read ahead of the audio or who rely on caption timing to follow the presentation pace. OCR accessibility investigations have cited timing drift as a failure of SC 1.2.2 independently of word-level accuracy — a video can have 100% accurate caption text and still fail WCAG compliance if the captions are consistently displayed 3–5 seconds ahead of or behind the audio.
Timing drift by video length
Based on the accumulation dynamics described above, timing drift in AI-generated video follows a characteristic pattern by length. For videos under 3 minutes (approximately 30 caption segments), accumulated drift from synthesis timing errors typically stays within ±0.5 seconds, which is well within the WCAG tolerance and invisible to most learners. Synthesia and HeyGen videos in this length range are generally timing-compliant with their built-in caption synchronization.
For videos in the 5–10 minute range (50–100 caption segments), accumulated drift reaches the ±1 to ±2 second range. Videos in this length range are in a compliance-uncertain zone: some will be within the WCAG threshold, others will not, and the determination requires measurement rather than assumption. L&D teams cannot assume that a 7-minute Synthesia module is timing-compliant without verifying the caption timing against the audio in the rendered MP4.
For videos over 10 minutes — which describes most compliance training modules, technical onboarding content, and product training videos in L&D libraries — accumulated timing drift regularly exceeds the WCAG ±2-second threshold, and in longer videos (20–30 minutes, common for annual compliance training) the drift at the end of the video is often in the 5–15 second range. These videos have captions with accurate text that are nonetheless WCAG non-compliant due to timing. The auto-captions WCAG compliance status post covers the SC 1.2.2 synchronization requirement and how it is enforced in practice.
Timing drift affects different AI platforms differently
Synthesia uses script-derived timing with synchronization quality that declines as described above with video length. HeyGen, which generates captions by running STT on its own TTS output rather than by synchronizing the script text, produces word timestamps from the STT alignment — timestamps that are more accurate in sub-second precision than Synthesia's script-estimated timestamps but suffer from the STT accuracy problem on technical vocabulary. Descript, which uses word-timestamp grounding from its transcript editor, produces timing that is tightly aligned to the audio (Descript's STT is trained with a word-timestamp emphasis that other models do not prioritize). Lumen5's timing approach depends on whether captions cover text-card slides (where text display timing is set directly) or AI voiceover segments (where captions are generated from TTS audio). The platform-specific sections below cover the timing approach and drift characteristics for each platform in detail.
Synthesia: caption architecture, failure modes, and correct workflow
Caption architecture
Synthesia's built-in caption system derives caption text from the script — the text the L&D author typed. Caption word content is therefore inherently accurate. Synthesia's captioning challenge is exclusively a timing problem: the script text must be synchronized to time positions in the TTS audio, and that synchronization uses Synthesia's TTS timing model to estimate when each phrase will be spoken. As described in the timing-drift section above, this estimation is accurate for short videos and accumulates error for longer ones.
Synthesia's caption feature is available on Creator tier ($67/month) and above. On the Starter plan, captions are not exported. The Synthesia captions workflow page documents the three Synthesia captioning failure modes and the five-step LMS workflow in detail. In brief: the built-in captions work correctly as a preview-quality aid in the Synthesia editor, but the production-quality caption track for LMS delivery requires export, timing verification, and glossary correction before upload.
The three Synthesia captioning failure modes
Timing drift in videos over 10 minutes is the first Synthesia captioning failure mode. The second — and more common — failure is the LMS export workflow: most L&D teams export only the MP4, not the MP4 plus SRT. When the MP4 reaches the LMS without a sidecar SRT, the LMS either shows no captions or activates auto-captioning by running STT on the Synthesia avatar's TTS voice. This is the TTS→STT round-trip described above — and for compliance, engineering, or medical training content, the resulting accuracy will not meet WCAG 2.1 AA requirements.
The third Synthesia failure mode affects technical vocabulary in the script itself. Synthesia's caption system uses the script text verbatim. If the script contains a product name, acronym, or technical term that Synthesia's TTS voice pronounces in a way that diverges from standard pronunciation, the script text in the caption will not match the audio. For example, a script containing "OAuth 2.0" may produce a TTS voice that says "oh-ath" or "oh-auth" — neither matches the standard pronunciation (/ˈoʊɔːθ/) cleanly, and the caption shows "OAuth 2.0" while the audio sounds different. This creates a synchronization mismatch between what the learner reads and what they hear, which is an accessibility failure for learners who lip-read or who use audio-and-caption together. The proper noun failure modes in captioning covers this class of error in detail.
Synthesia glossary and the GlossCap gap
Synthesia offers a glossary feature in its video editor that helps the TTS engine pronounce custom terms correctly — you can specify that "SCIM" should be pronounced "skim" rather than "S-C-I-M", or that a product name should have specific syllable stress. This glossary improves audio quality but is separate from caption quality. The Synthesia glossary corrects how the TTS voice says a term; a glossary-biased captioning step (like GlossCap's) corrects how the caption text represents that term after the audio is generated. Both steps matter. A Synthesia video in which the TTS voice correctly says "Kubernetes" but the LMS auto-captions it as "Kuba netes" needs the captioning-layer glossary correction — Synthesia's pronunciation glossary does not protect against LMS STT errors on the exported audio.
The correct Synthesia-to-LMS workflow
The correct workflow for WCAG 2.1 AA compliance on Synthesia training video requires five steps. First, export both the MP4 and the SRT file from Synthesia — these are separate export actions and both must be completed. Second, for videos over 5 minutes, run a timing verification step that compares SRT timestamps against the actual audio in the MP4 and corrects any accumulated drift. Third, apply organizational glossary correction to the SRT — any technical terms, product names, or acronyms that the LMS might render differently if auto-captioning were applied should be verified in the SRT export and corrected if the TTS voice pronunciation creates a mismatch. Fourth, upload both the MP4 and the corrected SRT to the LMS using the LMS's caption upload interface (not just the video upload). Fifth, verify in the learner view that the caption track appears, is selectable, and is synchronized — this verification step must be done in the LMS player, not in Synthesia's editor preview.
HeyGen: STT-generated captions, the accuracy gap, and the glossary fix
HeyGen's caption architecture: STT on TTS output
HeyGen (launched 2020, one of the fastest-growing AI video platforms in the L&D market through 2025) has a fundamentally different caption architecture from Synthesia. Where Synthesia derives captions from the script text and synchronizes them to TTS timing, HeyGen generates captions by running speech-to-text on its own TTS audio output. HeyGen renders the avatar video (TTS synthesis of the script), then runs STT on the rendered audio to produce captions with word timestamps.
This architecture has a timing advantage — word timestamps from STT on the actual audio are more precisely aligned to the audio than estimated timestamps from a TTS timing model. HeyGen's caption timing synchronization is generally better than Synthesia's for long videos because the timestamps are derived from the actual audio position of each word rather than from a synthesis estimate. But HeyGen pays for this with the full TTS→STT accuracy paradox. HeyGen's caption accuracy on the built-in captions is whatever the STT model achieves on HeyGen's TTS voice output — and as described in the acoustic-differences section, that accuracy is systematically lower than the original script text for technical vocabulary.
HeyGen caption accuracy on technical L&D vocabulary
HeyGen's built-in caption quality for soft-skills content (interpersonal communication, leadership, customer service scenarios) is typically 91–95% on a DCMP-protocol word-level accuracy measurement. This is within the range that some organisations treat as adequate, but it is below the 99% WCAG 2.1 AA threshold and will fail a formal accessibility audit. For technical content, the gap widens substantially. Engineering and DevOps onboarding content produced in HeyGen with technical vocabulary — SDK names, command-line identifiers, architectural acronyms — produces caption accuracy in the 73–83% range on the built-in STT-generated captions. Compliance training with regulatory vocabulary (FCPA, GDPR Article citations, OSHA standard references, HIPAA technical safeguards) runs 76–85%. Medical training with pharmacology and procedure terminology falls below 75% on a word-level DCMP protocol measurement.
These numbers represent a finding that is surprising to L&D teams who checked the "captions available" box on HeyGen during platform evaluation. "Captions available" means HeyGen generates and displays a caption track. It does not mean the caption track meets WCAG 2.1 AA accuracy. The caption vendor accuracy evaluation methodology describes the evaluation protocol — and the same methodology applies to evaluating AI video platform built-in captions before relying on them for WCAG compliance.
HeyGen Video Translation: compounded TTS→STT risk
HeyGen's Video Translation feature (which dubs a source video into 29+ languages by cloning the speaker's voice and translating the script) creates an additional layer of TTS→STT complexity. The translated video uses a voice-cloned TTS voice in the target language — a synthetic voice, not a native human speaker. When the LMS auto-captions the translated video, it runs STT on a synthetic voice in a language where the STT model's training data may be substantially less extensive than for English. The TTS→STT accuracy paradox applies to the translated video's captions, and in many target languages, the STT model's coverage of technical vocabulary in that language is lower than it is in English. Organisations using HeyGen Video Translation for multi-language L&D content must apply caption quality verification to each language version independently.
HeyGen SRT export and LMS workflow
HeyGen exports SRT files from the Video editor. The SRT export includes the STT-generated word timestamps and caption text. For LMS delivery, the same pattern as Synthesia applies: the default workflow exports the MP4, not the MP4 plus SRT. Teams that have discovered this failure mode correct their workflow by adding the SRT export to the publishing checklist. The additional step beyond Synthesia's workflow is that HeyGen's SRT requires more extensive text correction than Synthesia's, because HeyGen's captions are STT-generated rather than script-derived — the text itself needs correction for technical vocabulary errors, not just timing verification.
The practical workflow for HeyGen WCAG compliance: export MP4 and SRT → apply text correction for technical vocabulary errors (where the STT mis-transcribed the TTS voice) using a glossary-correction step → verify timing synchronization (typically better than Synthesia but should still be checked for videos over 10 minutes in complex multi-speaker scenarios) → upload both MP4 and corrected SRT to LMS → verify in learner player. The text correction step is more labour-intensive for HeyGen than for Synthesia because HeyGen's errors are vocabulary errors (wrong words) rather than purely timing errors (correct words, wrong position). The caption QA methodology for training video teams covers the systematic approach to identifying and correcting vocabulary errors efficiently.
Descript: transcript-grounded captions, Overdub sections, and re-captioning risk
Descript's editing model and caption architecture
Descript (founded 2017) takes a different approach to video editing than Synthesia or HeyGen. Rather than generating video from scratch, Descript's primary use case is editing existing recorded audio and video by editing the transcript. When a human speaker records a training module, Descript transcribes the audio using STT, produces an editable transcript, and allows the editor to modify the video by modifying the text. Delete a word in the transcript; Descript removes it from the video and audio. Rearrange sentences in the transcript; Descript rearranges the corresponding audio segments. Export the edited video; Descript produces an MP4 in which the audio has been re-assembled according to the transcript edits.
For human-narrated content, Descript's caption quality is high. Descript's STT transcription (which produces the editable transcript) is the source of the caption file, and the caption file is grounded in word timestamps from that STT run. For human voice, this STT is operating in its designed use case — natural human speech — and accuracy is typically 93–97% on L&D content before vocabulary correction. The word timestamps from Descript's STT are precisely aligned to the actual audio position of each word (Descript's STT pipeline is specifically optimized for word-accurate timestamp alignment, because the entire editing model depends on it). Caption timing from Descript for human-narrated content is typically the best available from any of the four platforms discussed here.
Overdub: where the TTS→STT paradox enters Descript content
Descript's Overdub feature allows editors to replace, insert, or modify audio segments using an AI voice model trained on the speaker's voice. An L&D author who mis-spoke a product name, needs to update a figure from last year's training, or wants to add a sentence without re-recording can type the new text in the Descript transcript; Overdub synthesizes the corresponding audio in the author's voice and inserts it into the video timeline. From a viewing perspective, the Overdub segment sounds like the human speaker — the voice model is trained on the specific speaker's voice patterns. From an acoustic STT perspective, the Overdub segment is TTS audio, carrying all five acoustic differences described earlier.
For caption quality within Descript, Overdub sections are handled well: Descript knows which segments are Overdub (they are flagged in the transcript timeline) and uses the typed text as the caption source for Overdub segments rather than re-running STT on the Overdub audio. This is correct behaviour and produces accurate caption text for Overdub sections in Descript's own caption export.
The risk surfaces when the exported MP4 is processed by an LMS. The LMS does not know which audio segments are human-narrated and which are Overdub. When the LMS auto-captions the exported MP4 (because the Descript SRT was not uploaded, or because the LMS re-generates captions on upload), it runs STT on the entire audio track including the Overdub segments. For Overdub segments containing technical vocabulary — the segments most likely to be created via Overdub, because product name corrections and terminology updates are common Overdub use cases — the LMS STT will produce the same accuracy degradation that applies to any TTS audio. A training module with 18 human-narrated minutes and 2 minutes of Overdub corrections on technical terms will produce LMS auto-captions that are nearly correct except at exactly the segments that were corrected, where the most important terminology updates are now mis-captioned.
Identifying Overdub sections and protecting caption accuracy
L&D teams using Descript can identify Overdub sections in the transcript timeline (they are flagged with a distinct colour indicator). The key protection is the same as for Synthesia and HeyGen: always export the Descript SRT file alongside the MP4, and always upload both to the LMS. When Descript's own SRT is used as the LMS caption source, the Overdub sections are captioned correctly (from the typed text), and the LMS auto-captioning STT problem does not arise. When only the MP4 is uploaded and LMS auto-captioning activates, the Overdub-section accuracy problem manifests.
A second risk in Descript is timeline editing that produces SRT timing drift. When an editor rearranges or shortens segments in the Descript transcript, the SRT file's timestamps update to reflect the edited timeline — this is Descript's core feature and it works correctly. The risk is when a video is exported from Descript, the SRT is exported, and then the video is further processed (colour correction, title card addition, music mixing) in a separate video editor before final export. If the additional editing shifts the audio timeline without updating the SRT timestamps — even by a few frames for fade-in/fade-out effects — the SRT timing will be incorrect on the final MP4. This is not a Descript-specific problem, but it is a risk unique to Descript workflows where downstream video editing is common.
Descript SRT export and LMS workflow
Descript exports SRT files via File → Export → Captions. The export includes word-level timestamps (accurate for human-voice segments, script-derived for Overdub segments). The correct workflow: export the final MP4 and SRT from Descript's final timeline (not an intermediate edit) → verify the SRT timing alignment against the final MP4 (not against the Descript editor preview) → apply glossary correction for any technical vocabulary that may have been mis-transcribed by Descript's STT in the human-voice segments → upload both MP4 and SRT to the LMS → verify in the learner player. For Descript content with minimal Overdub, this is the simplest workflow of the four platforms discussed here.
Lumen5: text-card accuracy vs AI voiceover accuracy
Lumen5's two captioning contexts
Lumen5 (launched 2017) occupies a different position in the AI video stack from the other three platforms. Where Synthesia and HeyGen produce avatar-presenter video from a script, Lumen5 converts text content — articles, blog posts, knowledge-base entries, training outlines — into video by combining stock footage or branded imagery with text overlays and optional voiceover. The primary Lumen5 L&D use case is converting existing text content (job aids, compliance guides, product documentation) into video microlearning modules.
Lumen5 has two distinct caption contexts with very different accuracy profiles. In text-card-primary videos — where the video consists of text cards with stock footage backgrounds and no voiceover — captions are the text cards themselves, and caption accuracy is inherently 100%. The text displayed on screen is the caption. There is no audio to transcribe; the learning content is presented visually. For this use case, Lumen5 video has no TTS→STT accuracy problem, and caption compliance is straightforward.
For Lumen5 videos with AI voiceover — Lumen5's feature that synthesizes a narration audio track from the script text for each slide — the TTS→STT paradox applies fully. The AI voiceover generates TTS audio narrating each slide's content. If the LMS auto-captions the exported Lumen5 MP4 (rather than using a sidecar SRT), it runs STT on the TTS voiceover audio, and the accuracy degradation on technical vocabulary occurs on the narrated segments.
Lumen5 caption export
Lumen5 provides SRT caption export on Business plan ($149/month) and above. On Starter and Basic plans, SRT export is not available. For L&D teams using Lumen5 for WCAG-compliant training video, Business plan is the minimum required plan to access caption file export. Without SRT export capability, the caption workflow is limited to LMS auto-captioning of the exported MP4 — which, for AI-voiceover videos with technical vocabulary, will not meet 99% accuracy.
Lumen5's SRT export covers narrated segments with AI voiceover. For text-card-only segments without narration, the SRT marks those segments with blank or minimal caption content; the text-card content itself is the visual presentation and the SRT does not duplicate it. When uploading to an LMS, L&D teams should verify that the LMS player shows the SRT captions on narrated segments and that text-card segments are visually accessible on their own (which they are by design).
Lumen5 vocabulary specifics
Lumen5's primary L&D use case — converting existing text content into video — means the narration scripts often contain the same technical vocabulary as the source documents. A compliance guide converted into a Lumen5 video will have a voiceover narrating the FCPA provisions, HIPAA requirements, or OSHA citation formats present in the guide. These are exactly the vocabulary categories where TTS→STT accuracy degrades most. Teams using Lumen5 for compliance microlearning should apply the same vocabulary correction workflow as for any AI-voiceover platform: export SRT, check technical terminology accuracy against the source document, correct any mis-transcription errors, and upload the corrected SRT to the LMS.
Lumen5 also supports text-on-screen (lower-third captions) as a separate visual feature from caption SRT files. Text-on-screen is a visual presentation element that is embedded in the MP4; it is not a selectable caption track and does not satisfy WCAG SC 1.2.2. Learners who cannot see the video (low-vision users) cannot access text-on-screen through assistive technology. The WCAG requirement is for a selectable caption track, not for on-screen text. L&D teams who rely on Lumen5's text-on-screen feature as their caption compliance mechanism are not meeting the WCAG 2.1 AA requirement.
Technical vocabulary failure by content type
Engineering and DevOps onboarding: the highest error-density category
Engineering onboarding and DevOps training content has the highest technical vocabulary density of any L&D content category, and produces the most severe TTS→STT accuracy degradation. The vocabulary includes product names (Kubernetes, Terraform, HashiCorp Vault, Datadog, Splunk), programming language identifiers and keywords (OAuth, SAML, OIDC, SCIM, REST, gRPC), infrastructure terminology (CIDR, BGP, NAT, VPC), and version-qualified references (Python 3.11, Node.js 20, OpenSSL 3.x).
Characteristic errors in TTS→STT on engineering content include: "Kubernetes" → "Kuba netes" or "cube a netes"; "Terraform" → "terra form" (correct, but with incorrect capitalization — "terra form" vs "Terraform" — which counts as an error in DCMP protocol word-level scoring); "HashiCorp" → "hash e corp" or "Hashey Corp"; "OAuth" → "oh auth" or "author" (semantic error); "SAML" → "Sam" or "same" or "Samuel" (semantic error); "OIDC" → "OI DC" or "oid sea" (phoneme-level error); "gRPC" → "grip see" or "G R P C"; "CIDR" → "cider" (homophonic substitution — the TTS voice may also say "cider"). The proper noun failure mode taxonomy covers the systematic classification of these error types: homophonic substitution, phoneme-level failure, compound-word segmentation error, and acronym expansion.
Round-trip TTS→STT accuracy on engineering onboarding content: 73–82% on DCMP-protocol word-level measurement. Human-narrated equivalent: 87–93% before glossary correction. Glossary-corrected target: 99%+. The glossary correction step is more impactful for AI-generated engineering content than for any other content category because the error rate before correction is highest and the value of correct technical terminology is greatest (learners who mis-learn a product name due to caption error have a materially worse training outcome).
Compliance and regulatory training: regulatory vocabulary as a special case
Compliance training — FCPA, GDPR, HIPAA, OSHA, AML/BSA, SOX, CCPA — contains a specific vocabulary that is neither general-purpose English nor standard technical jargon. It includes named laws (Foreign Corrupt Practices Act, abbreviated as FCPA), regulatory agency names (FinCEN, OFAC, OCC, CFPB, SEC, OCR — where OCR in a compliance context means Office for Civil Rights, not Optical Character Recognition), procedural terms (Suspicious Activity Report → SAR, Standard Chartered Accord → terminology specific to each regulatory domain), and citation formats ("Article 6(1)(b)" in GDPR, "29 C.F.R. Part 1910" in OSHA).
The TTS→STT paradox is particularly damaging for compliance vocabulary for two reasons. First, compliance training is the content category with the highest regulatory vocabulary density relative to general-English vocabulary. Nearly every sentence in a FCPA training module contains a term that the STT model has low prior probability over. Second, compliance training content serves as documentation: in regulated industries, employee completion of compliance training (and the content of that training) is evidence in regulatory investigations and litigation. A compliance training module with inaccurate captions documenting the wrong law, the wrong citation format, or the wrong regulatory requirement is not just an accessibility failure — it is a documentation failure with potential regulatory and legal consequences.
Specific error examples in TTS→STT on compliance content: "FinCEN" → "fin sin" or "fin sen" or "fence in"; "FCPA" → "F-C-P-A" (as separate letters) or "fuh cap uh" (phonemic rendering); "GDPR Article 6(1)(b)" → "GDPR Article 6 1 B" (loses parentheses — structural error in a legal citation); "OSHA 300 log" → "OSHA three hundred log" (numeral expansion — a content error when a learner expects to see "300" and sees "three hundred"); "CCPA" → "C-C-P-A" or "sepa" (phonemic). These are not merely formatting preferences — they are accuracy failures in a compliance documentation context. Round-trip TTS→STT accuracy on compliance training: 76–84%. Human-narrated equivalent: 84–92%.
Medical and clinical training: pharmacology and procedure terminology
Medical and clinical training has the most specialised vocabulary of any L&D category and the lowest TTS→STT accuracy. Pharmacological names (methotrexate, tacrolimus, bevacizumab, adalimumab) are low-frequency multi-syllabic terms that STT models trained on general-purpose speech have extremely low prior probability over. Procedure names (hysteroscopy, bronchoscopy, laparoscopic cholecystectomy), anatomy terms with Latin/Greek etymology, and dosage-qualifier sequences (5 mg/kg IV, bid, prn) all represent vocabulary categories where the STT model's uncertainty is highest and where the TTS→STT accuracy gap is most severe.
For clinical training produced in Synthesia or HeyGen — a growing use case as healthcare organisations use AI video for nursing education, clinical protocol updates, and onboarding for clinical staff — the caption accuracy on the built-in STT-generated captions can be as low as 69–72% on pharmacology content before correction. A clinical nursing education module in which drug names, dosage routes, and contraindication terms are 25–30% incorrect in the caption track is not just WCAG-non-compliant; it is potentially dangerous for learners who use captions as their primary access to the training content. The Whisper accuracy benchmarks by vertical covers medical vocabulary failure rates in detail.
Sales enablement and product training: the vocabulary-churn challenge
Sales enablement and product training content sits between engineering onboarding and soft-skills content in technical vocabulary density. Product names, feature names, pricing tier names, and integration partner names constitute the custom vocabulary that sales teams must learn precisely. The TTS→STT challenge for this content category is less about raw accuracy and more about vocabulary currency: product names change at every launch cycle, pricing tier names are updated, integration partners are added and removed. AI-generated video is particularly popular for this content because re-rendering with updated content is fast — a Synthesia module can reflect a product name change in a new render within 20 minutes.
But the caption vocabulary update is separate from the video re-render. When the script is updated and the video is re-rendered, the SRT must also be re-generated and re-uploaded to the LMS. If the LMS has a cached version of the previous SRT, the new video may show the old caption text. If no SRT was uploaded and the LMS was using auto-captioning, the new TTS audio must be re-captioned after the new render. The vocabulary update creates a caption-version management problem that teams using AI video for fast-cycle content often discover only when a customer-facing learner reports that the captions still show the old product name after the video was updated.
The organizational glossary as the systematic correction mechanism
The common thread across all four content categories is that the vocabulary errors in TTS→STT are systematic and predictable. The terms that will be mis-captioned are the same terms for every video that uses the same vocabulary: "Kubernetes" will be mis-captioned the same way in every HeyGen video produced with the same HeyGen avatar. "FinCEN" will be mis-captioned the same way in every Synthesia compliance module produced by the same organisation. The error pattern is a function of (TTS voice model, vocabulary term, STT model) — all three are fixed for a given L&D environment.
A systematic correction mechanism is more efficient than per-video manual correction: build a glossary of the terms that are mis-captioned in this organisation's AI-generated video, apply the glossary correction to every caption file before LMS upload, and update the glossary when new terms are added to the organisation's vocabulary. This is the same glossary-biased captioning principle described for human-narrated video in the glossary-biased captioning for engineering terms post — it applies with even greater impact to AI-generated video because the baseline error rate before correction is higher. The caption feedback loop and iterative accuracy improvement covers how the glossary improves over time as corrections are applied and catalogued.
The correct caption workflow for AI-generated training video
The five-step workflow
The correct caption workflow for AI-generated training video consists of five steps that must be completed in order for every video before LMS delivery. The workflow is the same regardless of which platform produced the video, with platform-specific variations noted below.
Step 1: Export both the video and the caption file from the AI video platform. This is the most commonly skipped step. For Synthesia: File → Download → Download both the MP4 and the SRT. For HeyGen: Export → Video, and then Captions → Export SRT. For Descript: File → Export → Captions (SRT format) in addition to the video export. For Lumen5 (Business plan): Download → Subtitles (SRT). These are separate interface actions in all four platforms, and publishing checklists must explicitly include the caption file export as a mandatory step.
Step 2: For videos over 5 minutes, verify timing synchronization. Open the exported MP4 alongside the SRT file in a media player or caption editor that shows both the video and the SRT timestamps simultaneously. Spot-check five timestamps distributed across the video: approximately at the 20%, 40%, 60%, 80%, and 100% (near-end) points. For each spot-check, verify that the caption text is displaying within ±0.5 seconds of the corresponding audio. If any spot-check shows drift exceeding ±1 second, apply a timing correction pass using a caption editor that allows timestamp adjustment. For Descript content with minimal Overdub, this step can typically be skipped — Descript's word timestamps are reliably accurate. For Synthesia videos over 10 minutes, this step should be considered mandatory.
Step 3: Apply organizational glossary correction. For HeyGen content (STT-generated captions): review every technical term, product name, acronym, and regulatory identifier against the organizational glossary and correct any mis-transcription. For Synthesia content (script-derived captions): verify that technical terms in the SRT match the intended terminology in the script. For Descript content: review Overdub sections specifically for any vocabulary terms that the Descript STT may have mis-transcribed in human-voice segments, and verify that Overdub sections contain the expected text. The glossary correction step is most labour-intensive for HeyGen content (vocabulary errors, not just timing) and least labour-intensive for Descript human-voice content (high baseline accuracy before correction).
Step 4: Upload both the video and the corrected SRT to the LMS. Each LMS has a specific upload interface for caption files. For TalentLMS: navigate to the video content item → Captions tab → Upload SRT. For Docebo: Course Management → video module settings → Captions. For Cornerstone OnDemand: Learning module → content item → Accessibility settings → upload VTT (convert SRT to VTT before upload). For Kaltura: Media Entry → Captions → Add Caption. For Panopto: the video item → Edit → Captions → Import caption file. For Workday Learning: the course content → video settings → Caption File upload. The specific interface differs by LMS, but the principle is consistent: the SRT (or VTT) file must be uploaded as a separate item from the video file, through the LMS's caption management interface, not embedded in the MP4.
Step 5: Verify the caption track in the learner view. This step must be done in the actual LMS learner interface, not in the admin interface or the AI video platform's preview. Log into the LMS as a learner (or use a learner-preview mode), navigate to the course, play the video, and verify that: (a) the CC or caption button is visible in the player; (b) selecting it activates the caption track; (c) the caption text is accurate for the first 30 seconds; (d) the caption timing is synchronized with the audio. This verification step is the only way to confirm that the SRT upload was successful and that the LMS is serving the correct caption file rather than re-generating captions from the audio. A five-minute verification step after each video upload prevents the class of errors where the LMS silently ignores the SRT file, displays the old cached captions, or generates new auto-captions regardless of the uploaded file.
When to rebuild captions from scratch
The workflow above covers the correction and delivery of captions generated by the AI video platform. In some cases, rebuilding captions from scratch is more efficient than correcting a heavily-erroneous existing file. The rebuild-from-scratch threshold is approximately 15% error rate in the SRT before correction: if more than 15% of caption segments require text correction (HeyGen technical content is frequently in this range), building a new SRT from the script text and synchronizing it to the audio is faster than correcting the existing file segment-by-segment.
Script-to-SRT conversion for Synthesia and HeyGen content (where the script exists as the ground truth) requires a timing synchronization step: forced alignment of the known script text to the rendered audio. This alignment can be done using speech recognition in a forced-alignment mode (where the model aligns the known text to the audio rather than recognising unknown text) or by using the platform's own timing data as the starting point and correcting the timestamps where drift has accumulated. For HeyGen content specifically, the platform's STT-generated SRT provides good timing data even when the text is incorrect — using the HeyGen SRT's timestamps with the script text substituted for the STT-generated text is an efficient rebuild approach for technical content with many text errors.
Team workflow and publishing checklist integration
Caption compliance for AI-generated video is a workflow discipline problem, not a technology problem. The technology solution (SRT export, glossary correction, LMS upload) is straightforward. The workflow solution requires that every L&D author who publishes to the LMS has the caption workflow steps embedded in their publishing checklist and understands that completing the video export is not the same as completing the publishing process.
In teams where L&D authors are frequent Synthesia or HeyGen users, the most effective intervention is a two-item publishing checklist prominently placed in the LMS publishing workflow: (1) "Did you upload the SRT file to the Captions tab?" and (2) "Did you play the video in learner preview and verify that captions appear and are accurate?" These two checkboxes, added to the LMS admin publishing form or to the team's course-publishing Jira/Asana template, catch the majority of AI video caption failures before the course goes live. The caption programme governance policy template provides a complete governance framework including publishing checklists, exception procedures, and audit triggers that L&D teams can adapt for AI-generated video.
WCAG 2.1 AA compliance requirements for AI-generated video
SC 1.2.2 and AI-generated content
WCAG 2.1 Success Criterion 1.2.2 (Captions — Prerecorded, Level AA) requires that captions are provided for all prerecorded audio content in synchronized media. The requirement applies identically to AI-generated training video and human-recorded training video. The production method — human narrator, AI avatar, or any other mechanism — is not a factor in whether SC 1.2.2 applies or what it requires. A Synthesia training module that an employee is assigned to complete is prerecorded synchronized media; it must have compliant captions.
Two elements of SC 1.2.2 are frequently tested in WCAG audits of AI-generated video. First, the accuracy requirement: captions must include equivalent information — which accessibility audit practice interprets as requiring the 99% word-level accuracy standard from the DCMP Captioning Key. AI-generated captions that fall below this standard (HeyGen's built-in STT captions on technical content at 73–84%, Synthesia LMS-auto-captioned at similar accuracy levels) do not meet SC 1.2.2. Second, the synchronization requirement: captions must be synchronized with the audio. Timing drift in Synthesia and HeyGen captions that exceeds ±2 seconds fails SC 1.2.2's synchronization element independently of word accuracy. A video can have accurate caption text but non-compliant timing. Both elements must be satisfied simultaneously.
ADA Title I applicability to AI-generated training video
ADA Title I covers employers with 15 or more employees and prohibits discrimination in terms, conditions, and privileges of employment, including training and professional development. The EEOC's interpretive guidance (29 C.F.R. Part 1630, Appendix) identifies training as a covered benefit that must be provided accessibly. When an employer assigns a Synthesia-produced onboarding module, HeyGen-produced product training, or Descript-edited compliance training to employees, those modules must be accessible to hearing-impaired employees.
This means: if the AI-generated training video does not have WCAG-compliant captions, and an employee who is deaf or hard of hearing is assigned to complete it, the employer has failed its ADA Title I obligation. The employee may request an accommodation; if the accommodation cannot be provided in a timely way (because the SRT file does not exist, or is incorrect, or the LMS cannot deliver it), the employer is in a potentially adverse ADA position. The scale of AI video adoption makes this a systemic risk: organisations that have deployed Synthesia or HeyGen at scale for training production without establishing a WCAG-compliant caption workflow have a portfolio of potentially non-compliant training content that grows with every new module produced.
ADA Title II and the 2026 deadline
The 2026 ADA Title II accessibility deadline (which passed on April 24, 2026, for public entities with population 50,000+) covers new web content and digital content, including training video hosted on public-entity websites, LMS platforms, and learning portals. Public universities that adopted Synthesia or HeyGen for lecture capture, course supplement video, or online course development fell under this deadline for new content. The US caption compliance matrix maps the specific deadlines and compliance obligations by entity type.
For public universities and community colleges that have deployed AI video at scale for online learning — a substantial and growing use case — the compliance risk is that the video production workflow was optimised for Synthesia's production efficiency while the caption compliance workflow was not established in parallel. University L&D and online learning teams that discovered ADA Title II's April 2026 deadline while having a library of AI-generated course video without WCAG-compliant captions faced the same remediation problem as organisations with large backlogs of human-recorded video, plus the additional complexity of the TTS→STT accuracy gap making LMS auto-captioning less reliable than it would be for human-recorded content.
The European Accessibility Act and AI-generated training video
The European Accessibility Act (EAA), which has been enforceable since June 2025, covers digital services provided to EU consumers and employees, including internal training platforms at EU-based employers and EU subsidiaries of non-EU employers. AI-generated training video deployed on LMS platforms accessible to EU employees is covered by the EAA's requirements for accessible digital services, which incorporate EN 301 549 and WCAG 2.1 AA as the technical standard.
EU-based organisations that adopted Synthesia or HeyGen for L&D (both platforms are used extensively in EU and UK L&D teams) must meet the same caption quality requirements as any other training video under EAA. The TTS→STT accuracy paradox does not create a regulatory exception; the EAA requires captions that accurately represent the audio content, and the origin of the audio (human narrator vs AI avatar) is not relevant to this requirement. EU L&D teams should also note that some EU member states have adopted national implementing legislation for EAA with additional enforcement mechanisms beyond the EAA's directive requirements — the compliance obligation is not reduced by the fact that EAA enforcement infrastructure is still maturing.
LMS-specific delivery considerations for AI-generated video
How LMS platforms handle AI-generated video captions
Each LMS platform has specific behaviors when AI-generated video MP4 files are uploaded without a sidecar SRT. These behaviors determine what the learner experiences and what the compliance risk profile is for organisations that have not established the correct five-step caption workflow.
TalentLMS
TalentLMS (TalentLMS caption workflow) supports SRT and VTT sidecar upload per video content item. When a Synthesia or HeyGen MP4 is uploaded without a sidecar, TalentLMS does not auto-caption the video. The learner's video player displays no caption option. For hearing-impaired learners, the video is effectively inaccessible. This failure is visible — no CC button means no attempt to enable captions — but it is silent at the admin level: TalentLMS does not warn on upload that no caption file was provided. Teams must verify caption delivery through learner-view inspection.
Docebo
Docebo (Docebo caption workflow) supports SRT and VTT caption upload per video item. In some Docebo configurations, video is hosted through Docebo's integrated video hosting (Docebo Shape) or through external video hosting (YouTube, Vimeo, Wistia). When Synthesia MP4 is uploaded to Docebo Shape, no auto-captioning activates on upload. When the same MP4 is embedded from YouTube (a common workflow for Docebo L&D teams that publish to both YouTube and Docebo), YouTube's auto-captioning activates on the YouTube-hosted copy and those captions appear in the Docebo player — but they are YouTube's STT-generated captions on the TTS audio, not the Synthesia SRT. This creates a false confidence: the Docebo player shows captions for the video, the admin sees the CC button active, and the assumption is that captioning has been handled. The captions are YouTube auto-captions on AI avatar TTS audio, with accuracy on technical content in the 73–84% range.
Cornerstone OnDemand
Cornerstone OnDemand (Cornerstone caption workflow) has a specific behavior for video uploaded to Cornerstone's learning modules: if no caption sidecar is provided at upload, Cornerstone in some configurations activates its auto-captioning pipeline, which runs STT on the video audio. For Synthesia or HeyGen video, this activates the TTS→STT paradox in Cornerstone's auto-captioning pipeline. The resulting auto-generated caption track is flagged in the LMS admin as "captioned" — which it is, in the sense that a caption track exists. Whether that caption track meets 99% accuracy on the technical content in the video requires independent measurement. L&D teams using Cornerstone with AI-generated video should disable auto-captioning at the module level and manually upload the verified SRT for each AI video asset.
Kaltura
Kaltura (Kaltura caption workflow) offers a captioning service tier (Kaltura REACH) that automatically generates captions for uploaded video. For AI-generated training video, Kaltura REACH's machine-generated tier (which uses STT on the audio) will produce the TTS→STT accuracy degradation described in this post. Kaltura REACH's human-reviewed tier (which sends the audio to human transcribers and applies machine pre-processing) is more accurate but does not know that the audio source is TTS rather than human speech — human reviewers correct the machine output, but the TTS→STT errors on technical vocabulary require additional context (the organization's glossary) that Kaltura REACH transcribers do not have. The correct approach for Kaltura deployments with AI-generated video is to provide the SRT file directly (bypassing REACH auto-captioning) and use GlossCap's glossary-corrected output rather than Kaltura REACH's machine tier.
Panopto
Panopto (Panopto caption workflow) is the dominant video platform in higher education and corporate Panopto deployments often host AI-generated lecture supplement video and online course content. Panopto automatically generates captions for all uploaded video using its ASR captioning engine. For Synthesia or HeyGen video uploaded to Panopto, this auto-captioning activates and produces TTS→STT captions. Panopto marks these videos as captioned in the admin interface. The auto-generated captions are editable in Panopto's caption editor, but most content owners do not edit auto-generated captions unless specifically prompted to do so.
For university L&D teams that have adopted Synthesia for course supplement video and deployed it through Panopto, the result is that the entire Synthesia video library in Panopto is marked as "captioned" with auto-generated captions on TTS audio — accuracy unknown without measurement, and likely below 99% for any content with technical vocabulary. This is the portfolio-level compliance risk for AI video in higher education: auto-captioning on TTS video produces a library of nominally captioned but potentially non-compliant content at scale. The enterprise LMS caption audit methodology covers the approach to auditing this kind of portfolio systematically.
Workday Learning
Workday Learning (Workday Learning caption workflow) supports caption file upload for video content. Workday does not auto-caption uploaded video; if no caption file is provided, no captions appear in the Workday Learning video player. The consequence for AI-generated video without a sidecar SRT is a video that renders in Workday Learning with no caption option — the same silent failure as TalentLMS. For large enterprises that use both Workday as the HRIS/LMS and Synthesia or HeyGen for training video production, establishing the SRT upload step in the Workday content-publishing workflow is the critical intervention. Workday Learning's content publishing interface requires that the caption file be attached as a "Related Task" or through the specific accessibility configuration in the Learning content type — the exact path depends on the Workday implementation.
Eight failure modes in AI-generated video captioning
- Failure mode 1 — LMS auto-captioning activated on AI avatar audio
-
The most common failure: the AI video platform produces the correct SRT file, but the L&D author exports only the MP4, uploads only the MP4 to the LMS, and the LMS activates auto-captioning on the MP4 audio. The auto-captioning STT runs on the TTS avatar voice and produces the TTS→STT accuracy degradation on technical vocabulary. The LMS marks the video as captioned. The admin sees a CC button. The course goes live. The compliance reporting system shows the course as accessible. No one measures whether the auto-generated caption track meets 99% accuracy until a hearing-impaired learner reports inaccurate captions. By this point, the video has been live for weeks and potentially thousands of employees have completed it. This failure mode is platform-agnostic — it occurs with Synthesia, HeyGen, Descript, and Lumen5 video alike.
- Failure mode 2 — Timing drift in long AI avatar videos
-
For Synthesia and HeyGen videos over 10 minutes, the built-in caption timing accumulates drift beyond the WCAG ±2-second synchronization threshold. The caption text is accurate, but the timing is non-compliant. This failure is invisible to admins and to hearing users, because the captions look approximately right in a casual viewing. Hearing-impaired learners who read captions carefully notice that the text is consistently behind or ahead of the audio, making it difficult to follow the presentation at pace. In a formal WCAG audit, timing drift is testable and documented as an SC 1.2.2 failure. Long compliance training modules (20–30 minutes) produced in Synthesia are particularly susceptible — the most legally significant training content has the most timing drift.
- Failure mode 3 — HeyGen STT captions failing on content-specific vocabulary
-
HeyGen's built-in caption generation runs STT on its own TTS audio output. For L&D teams that evaluate HeyGen and test captions using a general-English demo script, the caption quality looks acceptable — general-English vocabulary is within the STT model's high-accuracy range. When the same team deploys HeyGen for their actual engineering onboarding, compliance training, or medical education content, the caption accuracy drops to the 73–84% range on content-specific vocabulary. The evaluation test did not expose the failure because the failure is vocabulary-specific. The correct evaluation approach — testing caption accuracy on the organisation's own technical vocabulary, not on a generic demo script — is described in the caption vendor accuracy evaluation methodology.
- Failure mode 4 — Descript Overdub re-captioning by LMS
-
An L&D team using Descript for training video editing uses Overdub to correct terminology errors, update product names after a launch, or add new regulatory requirements to an existing compliance training module. Descript's Overdub produces audio for these corrected segments that sounds like the original human narrator. The team exports the updated MP4 to the LMS. The LMS auto-captions the MP4 — including the Overdub segments — by running STT on the audio. The Overdub segments contain the organisation's highest-priority terminology updates, and those are exactly the segments where TTS→STT accuracy is lowest. The result: the corrections the team made are partially or fully mis-captioned in the LMS. The same update that fixed the video content introduced a captioning error at the correction point. This is invisible until the Overdub segment caption is specifically reviewed.
- Failure mode 5 — Re-rendered AI video with stale SRT timestamps
-
A product name changes; the L&D team updates the Synthesia script and re-renders the video. The re-rendered video has updated audio timestamps for the revised script — phrase durations may be slightly different because the TTS model synthesizes the new content at slightly different timing. The existing SRT file (from the previous render) has timestamps aligned to the previous audio. The team uploads the new MP4 and the old SRT to the LMS. The captions are now misaligned: they have the correct text (from the previous version, corrected for the previous content) but their timestamps are misaligned with the new audio. Additionally, the caption text for the revised script sections (which the team updated in the script but not in the SRT) shows the old content. The SRT must be regenerated and re-verified on every video re-render, not only when the SRT itself is changed.
- Failure mode 6 — Systematic vocabulary error across the AI video library
-
As noted in the acoustic differences section: TTS→STT errors for a given (avatar, vocabulary term) combination are systematic, not random. If "HashiCorp" is mis-captioned as "Hashey Corp" by the LMS STT engine on a specific HeyGen avatar's voice, it will be mis-captioned identically in every HeyGen video in the library that uses that avatar and includes the word "HashiCorp." A team that has produced 40 DevOps training modules using HeyGen over the past year, all using the same avatar, has 40 videos in the LMS with the same systematic vocabulary errors. Without a glossary-based systematic correction, each of these errors must be found and corrected in 40 separate SRT files. With a glossary-based correction applied at upload time, a single glossary entry corrects all 40 instances simultaneously. This is the economic argument for glossary investment in AI video workflows.
- Failure mode 7 — Multi-language AI video without per-language caption verification
-
An organisation using HeyGen Video Translation or Synthesia's multi-language rendering feature produces training video in English and five European languages. The English caption workflow has been established — SRT export, glossary correction, LMS upload. The translated versions are produced by the AI platform in the target languages and exported as MP4 files. The assumption is that the translation process handles captioning. In most cases, it does not: the translated MP4 is delivered to the LMS without a sidecar SRT, and the LMS auto-captions the translated audio in each target language. The TTS→STT paradox applies to each target language independently — in some cases more severely than in English, because STT model coverage of technical vocabulary is lower in lower-resource languages. The same compliance training that was properly captioned in English may be auto-captioned below standard in French, German, Polish, and Portuguese. Each language version requires independent caption quality verification.
- Failure mode 8 — Lumen5 text-on-screen mistaken for caption compliance
-
L&D teams using Lumen5 to convert text content into video frequently use Lumen5's text-on-screen feature — lower-third captions, slide text, and headline text that appears as a visual element in the video. This text is embedded in the MP4 as a visual element; it is not a selectable caption track and is not accessible to screen-reader users or to learners who use the browser's accessibility tree to access caption content. When an accessibility reviewer asks whether the Lumen5 video has captions, the answer "yes, the text appears on screen" is not a compliant response. WCAG SC 1.2.2 requires synchronized captions delivered as a separately selectable caption track, not as embedded video graphics. Organisations that have deployed Lumen5 video for compliance training relying on text-on-screen as the accessibility mechanism have a non-compliant content library regardless of how well the text content itself is written.
Seven-question FAQ
- Are Synthesia's built-in captions WCAG 2.1 AA compliant?
-
It depends on video length and on whether the SRT is delivered to the LMS. For videos under 3 minutes, Synthesia's script-derived captions are generally timing-compliant (within ±0.5 seconds) and text-accurate. For videos over 5 minutes, timing drift accumulates and may exceed the WCAG ±2-second synchronization threshold, producing a timing failure even when the text is accurate. For videos over 10 minutes, timing drift exceeds the WCAG threshold in most cases and requires a timing correction pass before LMS delivery.
The larger compliance risk is that the Synthesia SRT is lost at the LMS export step. If the LMS receives only the MP4 and auto-captions it by running STT on the Synthesia avatar's TTS voice, the resulting auto-captions do not meet 99% accuracy on technical vocabulary content. Whether Synthesia's built-in captions are WCAG-compliant is therefore conditional on the complete publishing workflow: export both MP4 and SRT, correct timing drift for longer videos, upload both to LMS, verify in the learner player. If any of these steps is missed, the compliance claim cannot be made. The Synthesia captions workflow page provides the step-by-step detail for each LMS platform.
- HeyGen says it provides captions automatically — are they compliant for ADA purposes?
-
Not reliably for technical L&D content. HeyGen generates captions by running STT on its own TTS audio output. For general-English soft-skills content, the accuracy is typically 91–95% — below the 99% WCAG 2.1 AA threshold but not dramatically so. For technical content — engineering onboarding, compliance training, medical education — accuracy drops to 73–84% on content-specific vocabulary. None of these accuracy levels meet the 99% WCAG 2.1 AA standard.
"Captions available" in HeyGen's interface means a caption track is generated and displayed. It does not certify that the caption track meets WCAG accuracy requirements. ADA Title I requires effective captions — captions that provide equivalent access to the audio content for hearing-impaired employees. A caption track that misrepresents product names, regulatory citations, and procedure terminology at 17–26% error rate does not provide equivalent access. To use HeyGen video for ADA-compliant training delivery, the HeyGen SRT must be exported, corrected for technical vocabulary errors using an organizational glossary, and uploaded to the LMS as the caption source in place of (not in addition to) LMS auto-generated captions.
- If we use Descript to produce training video, do we need additional captioning steps beyond Descript's built-in captions?
-
For human-narrated segments, Descript's STT-grounded captions are typically the most accurate available from any of the four platforms discussed here, with word-level accuracy of 93–97% on L&D content before glossary correction. The additional steps required are: (1) export the Descript SRT alongside the MP4 (not just the video), and upload both to the LMS; (2) apply glossary correction for any technical vocabulary that Descript's STT mis-transcribed in human-voice segments; and (3) specifically verify that Overdub sections contain the correct text in the exported SRT.
The Overdub risk is the primary additional concern beyond standard captioning workflow. Any segment produced with Descript's Overdub feature is TTS audio in the exported MP4. If the LMS auto-captions the exported MP4, it will mis-transcribe Overdub segments on technical vocabulary at the same accuracy rate as any other TTS audio. The protection is consistently exporting and uploading Descript's SRT, which uses typed text for Overdub segments and does not rely on STT reconstruction of those segments. If the Descript SRT is always in the LMS caption chain, Descript content can meet 99% accuracy with a lighter correction burden than HeyGen content.
- Can we use the script file as our caption SRT directly, without going through any STT step?
-
In principle, yes — the script text is 100% accurate, and using it as the caption text eliminates the TTS→STT accuracy problem entirely. In practice, the challenge is timing synchronization: a plain script file does not have timestamps indicating when each sentence begins and ends in the rendered audio. A script file must be converted to an SRT by adding per-segment timestamps before it can be used as a caption file.
For Synthesia and HeyGen, the platforms provide this timing as part of their SRT export — Synthesia estimates timing from the TTS synthesis model, HeyGen derives timing by running STT on the TTS audio. Both approaches have the limitations described in this post. A third approach — forced alignment of the script text to the rendered audio — uses a speech recognition model in forced-alignment mode, where the model aligns the known text to the audio rather than recognising unknown text. Forced alignment produces accurate word timestamps without the TTS→STT accuracy degradation, because the model is not classifying unknown phonemes — it is aligning known text to audio positions. Forced alignment tools (e.g., Montreal Forced Aligner, WhisperX in alignment mode) can process Synthesia or HeyGen audio and produce an SRT with the script text and accurate timing, combining the text accuracy of the script with timing accuracy close to Descript's word-timestamp grounding. For high-volume AI video production with significant technical vocabulary, forced alignment with organizational glossary application is the most reliable path to 99% caption accuracy.
- We produced 50 training videos with Synthesia over the past year. Do we need to audit all of them for caption compliance?
-
Yes, but prioritise by risk tier rather than auditing all 50 simultaneously. Tier 1 (highest priority): videos over 10 minutes (highest timing drift risk), videos assigned to employees as mandatory training (ADA Title I coverage), and videos with high technical vocabulary density (compliance, engineering, medical). For these, run the timing verification step described in this post and verify the SRT has been uploaded to the LMS as the caption source.
Tier 2 (medium priority): videos 5–10 minutes in length with moderate technical vocabulary. These are in the compliance-uncertain zone — some will be within WCAG timing tolerance, others will not. Spot-check timing at the 50% and 80% mark of each video; if drift is visible, apply correction. Tier 3 (lower priority): videos under 5 minutes, soft-skills content, and videos not assigned as mandatory training. These have lower timing drift risk and lower regulatory coverage; audit after Tier 1 and 2 are complete. For a 50-video library, the full audit typically takes 2–4 hours of focused effort — proportionally less than re-doing any of the videos, and substantially less than managing an ADA accommodation request or regulatory complaint about an inaccessible training module.
- How does ADA Title I apply to AI-generated training video — does the AI production method matter?
-
The production method is not legally relevant. ADA Title I prohibits discrimination in terms, conditions, and privileges of employment. Training is a covered benefit. An employer who assigns a Synthesia-produced onboarding module to an employee is providing training, and that training must be accessible to employees with disabilities who require captions. The employer's obligation is defined by the training assignment, not by the production method.
The AI production method is relevant operationally: it creates specific captioning challenges (TTS→STT accuracy paradox, timing drift, LMS export workflow gaps) that human-recorded video does not have in the same form. But it does not create a legal exception, a grace period, or a reduced compliance standard. An employer cannot argue that captions on an AI-generated training video meet the "equivalent access" standard if those captions are 73–84% accurate on the technical vocabulary in the video. The legal analysis is identical to human-recorded video: WCAG 2.1 AA is the accepted standard, 99% accuracy is the operationalised threshold, and an employee who is deaf or hard of hearing who cannot access the training content due to inaccurate captions has a cognisable ADA claim against the employer.
- What about HeyGen's video translation feature — does each translated version need its own caption compliance verification?
-
Yes. Each language version of a HeyGen-translated video is an independent media asset with its own audio track and its own caption compliance status. HeyGen Voice Translation (which dubs the video into a target language using AI voice synthesis) produces a TTS voice in the target language for the dubbed audio. This TTS voice is then potentially processed by LMS STT in the target language for auto-captioning. The TTS→STT paradox applies to each translated version independently.
The additional complication for translated versions is that STT model coverage of technical vocabulary varies by language. English technical vocabulary (DevOps acronyms, regulatory identifiers, medical terminology) has more STT training data than the same vocabulary translated into French, German, Spanish, or Polish, because the technical L&D content universe is proportionally larger in English. STT accuracy on translated AI video with technical content may be lower in the target languages than in the English source — the same compliance training module that produces 76% TTS→STT accuracy in English may produce 68–72% in translated versions where the STT model's technical vocabulary coverage in the target language is thinner. Each language version requires independent caption quality measurement, SRT export, vocabulary correction in the target language (which requires target-language glossary coverage, not just English glossary), and LMS upload verification. The multi-language caption workflow adds per-language overhead that should be built into the AI video production budget for any organisation deploying training video to multilingual employee populations.
Achieve 99% caption accuracy on your AI-generated training video
If your L&D team uses Synthesia, HeyGen, Descript, or Lumen5 to produce training video, GlossCap can help you close the TTS→STT accuracy gap at scale. GlossCap applies your organizational glossary — product names, SDK identifiers, regulatory acronyms, procedure terms — to every caption file, converting the systematic vocabulary errors in AI-generated video captions into correct terminology before the SRT reaches the LMS. The result: 99% WCAG 2.1 AA caption accuracy on your AI-generated training library, regardless of which platform produced it or which LMS delivers it. Start with a free accuracy spot-check on one of your existing AI-generated training videos.