Platform reference · Synthesia AI video · ADA Title I · WCAG 2.1 AA · AI training video · LMS export

Synthesia AI video captions: built-in caption limits, AI avatar voice and STT accuracy, and the LMS export workflow

Synthesia is the #1 AI video creation platform for L&D — used by 50,000+ companies including 60% of the Fortune 100 to produce professional training video from a text script and an AI avatar, with no camera, studio, or narrator required. Synthesia solves the video production problem. But it creates a distinctive new captioning problem that plays out in three failure modes: (1) Synthesia's built-in caption timing drifts in longer videos in ways that fail WCAG 2.1 AA's synchronization requirement; (2) the standard LMS export workflow — export the MP4, upload to the LMS — breaks the caption chain by leaving the SRT behind; and (3) when the LMS auto-captions the MP4 by running speech-to-text on the AI avatar's synthetic voice, accuracy degrades precisely at the technical vocabulary — SDK names, product identifiers, regulatory acronyms — that compliance and product training exist to teach. This page documents all three failure modes and the correct Synthesia-to-LMS caption workflow for WCAG 2.1 AA, ADA Title I, and EAA compliance.

TL;DR

Synthesia's built-in captions use the script text for accuracy but may fail WCAG 2.1 AA timing synchronization in videos longer than ~5 minutes. The bigger practical failure: most teams export the MP4 without the SRT and upload only the MP4 to their LMS, so the LMS either shows no captions or runs auto-STT on the AI avatar voice — producing accuracy degradation on technical vocabulary at exactly the terms the training covers. The correct workflow: export the Synthesia SRT alongside the MP4, run it through GlossCap's glossary-biased timing-correction pipeline, and upload both files to the LMS. ADA Title I requires accessible captions for every hearing-impaired employee assigned mandatory Synthesia training. EAA applies to EU companies' internal training from June 2025. The Synthesia captioning paradox: eliminating the human speaker also eliminates what makes STT reliable — a natural human voice — making round-trip accuracy (script → TTS audio → STT → text) meaningfully lower than captioning a video of a human reading the same script.

What Synthesia is and where captioning fits

The AI avatar production model

Synthesia (founded 2017, London HQ) produces training video from text: the L&D author writes a script, selects an AI avatar, and Synthesia's rendering engine synthesizes an MP4 in which the avatar reads the script aloud with synchronized lip movement. No camera, studio, actor, or timeline required. A 10-minute training module that previously required a filming day can be produced in under an hour and updated at any time by editing the script and re-rendering. Plans: Starter ($22/month), Creator ($67/month), Enterprise (custom). SRT caption export is available on Creator tier and above.

The captioning paradox

Synthesia eliminates the human voice actor and replaces it with a synthetic TTS voice. This produces a captioning paradox. On one side: because Synthesia generates video from a text script, the exact caption text is already known — the script. Captioning should be trivial. Synthesia's built-in caption feature does derive captions from the script, so caption text accuracy is typically good. On the other side: when script-derived captions are not used (SRT lost in export, LMS runs auto-captioning instead), the fallback is STT on the MP4 audio. That audio is a synthetic TTS voice — a type STT models were not primarily trained on, making round-trip accuracy (text → TTS → STT → text) meaningfully lower than captioning a natural human voice reading the same content. Synthesia's most valuable feature — no human speaker needed — is exactly the feature that makes the captioning fallback unreliable.

The three Synthesia captioning failure modes

Synthesia captioning failures in real L&D deployments occur through three distinct mechanisms. All three can leave hearing-impaired learners without compliant captions even when the L&D team believes captions have been handled.

Failure mode 1 — Built-in caption timing drift in longer videos

Synthesia's built-in caption engine maps the known script text to estimated timing positions based on the TTS synthesis timing model. Caption text is accurate (derived from the script). Timing synchronization is the failure point.

WCAG 2.1 SC 1.2.2 and accessibility audit practice treat ±2 seconds as the outer limit of acceptable caption-to-audio synchronization. In Synthesia videos up to roughly 3–5 minutes, the built-in synchronization typically stays within this tolerance. As video length increases, drift accumulates: a TTS timing model off by 0.1 seconds per phrase accumulates several seconds of error by the middle of a 15-minute compliance training video, and visually obvious desynchronization by the end of a 20–30-minute onboarding module. L&D teams producing Synthesia videos longer than ~5 minutes should not rely on Synthesia's built-in caption timing as their WCAG compliance mechanism. The caption text may be accurate; the timing may not be.

Failure mode 2 — The LMS MP4-without-SRT export

Synthesia exports video as MP4. On Creator tier and above, the SRT caption file is available as a separate export action. In practice, the dominant workflow is: export the MP4 → upload to LMS. The SRT is a separate interface action, and the mental model of "download the video" does not include "download the caption file" unless explicitly mandated in the publishing checklist.

When an MP4 without a sidecar SRT reaches the LMS, three outcomes are possible:

The MP4-without-SRT failure is invisible at the content-creation stage. The L&D author saw captions in Synthesia's preview. The failure occurs at the LMS upload step and is typically discovered only when a hearing-impaired learner reports broken captions — by which point the course has been live for weeks.

Failure mode 3 — AI avatar voice STT accuracy degradation

When the LMS auto-captions a Synthesia MP4, it runs speech-to-text on the audio track. That audio track is a synthetic TTS voice. All major STT systems — Whisper, Google Speech-to-Text, AWS Transcribe, Azure Speech, proprietary LMS STT engines — are trained primarily on natural human speech. Synthetic TTS voices are acoustically different from natural human speech in the dimensions STT models weight most heavily: prosodic regularity patterns, formant precision, and the absence of natural-speech boundary signals (disfluencies, breathing, co-articulation). STT accuracy on Synthesia audio is lower than on natural human speech for the same content, and the accuracy gap concentrates on technical vocabulary — acronyms, product names, regulatory terms, SDK identifiers — precisely the vocabulary that matters most in Synthesia's dominant L&D use cases.

The round-trip problem: Synthesia starts with text (the script), converts it to TTS audio, and the LMS converts the TTS audio back to text via auto-captioning STT. Each conversion step introduces error. The round-trip — script text → TTS voice → STT → caption text — does not recover the original script accurately. For technical vocabulary, the round-trip error rate can be high enough to render auto-captions on a cybersecurity or compliance Synthesia video semantically unreliable — even though the correct text exists as the script.

The AI avatar voice and STT accuracy: why synthetic speech is harder to transcribe

Acoustic differences between TTS voice and natural human speech

Modern TTS systems — including Synthesia's AI avatars — produce audio that sounds natural to human listeners, primarily because rhythm and intonation sound human-like. But the acoustic characteristics STT models use for phonemic discrimination differ from what human listeners use for naturalness. A TTS voice can sound highly natural while presenting an acoustic signal systematically different from the natural-speech training data in the dimensions that matter for STT accuracy: TTS prosody is model-generated and may not match the irregular stress patterns in the STT corpus; TTS formant frequencies are phonemically precise in ways natural speech (with its co-articulation and speaker variance) is not; and TTS voices lack the disfluency signals (ums, uhs, breathing) that natural speech uses as prosodic boundary markers. Together these differences lower STT accuracy on TTS voice, concentrated on the uncommon phoneme sequences where technical vocabulary lives.

The accuracy gap concentrates on uncommon phoneme sequences — which is exactly where technical vocabulary lives. Technical acronyms and product names contain phoneme combinations that are low-frequency in the conversational and broadcast speech corpora on which most STT models are trained. When produced by a TTS voice rather than a natural human speaker, these uncommon phoneme sequences are further removed from the STT model's training distribution, amplifying the error rate. Practical examples from Synthesia training audio: "SAML" → "Sam" or "same"; "OAuth" → "Oh auth" or "author"; "FIDO2" → "fi do two" or "fi do to"; "FinCEN SAR" → "fin sin star"; "SCIM" → "skim" or "scheme"; "Kubernetes" → "Kuba nettys" or "cube netes."

Why the script text is the solution — and why it is not being used

The accurate text exists: the Synthesia script. It is 100% correct, written in the author's intended terminology and capitalization, and it is exactly what the avatar will say. The entire STT accuracy problem disappears if the script text is correctly synchronized to the video audio and delivered as the caption file. The captioning challenge for Synthesia video is not a speech recognition problem. It is a timing synchronization problem: mapping the known script phrases to the time positions in the rendered audio when each phrase is spoken, with sufficient precision for WCAG 2.1 AA.

The reason the script is not being used is workflow friction. Synthesia's built-in captions use the script but synchronize imperfectly in longer videos (Failure Mode 1). The LMS export workflow loses the SRT entirely (Failure Mode 2). When the SRT is lost, the LMS falls back to STT on the TTS voice (Failure Mode 3). The workflow needs to be constructed deliberately to keep the script-derived captions in the caption chain from Synthesia through to the LMS player — which is exactly what the five-step workflow below does.

Synthesia to LMS export: where the caption chain breaks

LMS-specific caption chain requirements

Each LMS platform has a specific mechanism for attaching caption files to video content, and the failure mode when captions are omitted differs by platform:

The YouTube and host-platform auto-caption trap

When the Synthesia MP4 is uploaded to YouTube, Vimeo, or Microsoft Stream for hosting and embedded in the LMS via embed URL, YouTube's auto-captions activate by default and run STT on the audio — producing Failure Mode 3 degradation on the AI avatar voice. Replacing YouTube auto-captions with a corrected SRT requires four steps not in any default workflow: recognizing that auto-captions are inaccurate on TTS voice, downloading the corrected SRT from GlossCap, uploading it as a replacement, and verifying the LMS player shows the uploaded track. This must be in the publishing checklist to execute consistently.

Technical vocabulary in Synthesia's most common L&D use cases

Synthesia's dominant L&D categories all carry vocabulary surfaces that make generic STT inadequate:

In all cases, the Synthesia script is the authoritative vocabulary source. The captioning challenge is not recognizing what the avatar says — the script tells us. It is keeping the script's vocabulary in the caption chain through LMS delivery.

WCAG 2.1 AA, ADA Title I, and EAA for Synthesia video

WCAG 2.1 SC 1.2.2: what accurate and synchronized means

WCAG 2.1 Success Criterion 1.2.2 (Captions, Prerecorded) requires captions for all prerecorded synchronized media. Synthesia MP4 training videos are synchronized media. The WCAG 2.1 AA standard requires:

ADA Title I: mandatory training for hearing-impaired employees

ADA Title I (42 U.S.C. § 12112) requires employers with 15+ employees to provide reasonable accommodations for employees with disabilities. Any hearing-impaired employee assigned mandatory training — harassment prevention, cybersecurity awareness, compliance training, safety training, product onboarding — through an LMS has an ADA Title I right to receive that training accessibly. The fact that the video was produced by an AI avatar does not change the obligation. Synthesia's caption capability does not satisfy it if those captions are absent in the LMS player (Failure Mode 2) or inaccurate (Failure Mode 3). Mandatory compliance training is the highest-priority target: inaccurate captions on required compliance training are not a technical gap — they are a failure to train hearing-impaired employees on the required content.

EAA: EU-based companies and internal training video

The European Accessibility Act (Directive 2019/882) requires digital services meeting its scope to satisfy WCAG-based accessibility standards from June 2025. EU-based companies — including EU headquarters of global organizations and EU subsidiaries of US multinationals — using Synthesia for employee L&D must review their EAA posture for internal training video delivered to EU employees. See our EAA captions requirements reference for the full scope analysis. Where EAA applies, the captioning requirement is WCAG 2.1 AA: accurate, synchronized captions for all prerecorded training video with audio.

GDPR considerations

Synthesia videos are created from scripts and AI avatars — they do not contain real individuals' faces, voices, or biometric data, which removes the biometric and voice-data GDPR concerns that arise when captioning meeting recordings or camera-captured training. The script-based production model means Synthesia video is lower GDPR risk for captioning workflows than most other video types. GDPR Article 28 applies if Synthesia scripts contain personal data (names of real individuals, employee information, customer data) — in which case both Synthesia's data processing of your script content and GlossCap's processing of the audio or script for captioning require Data Processing Agreements. For the majority of Synthesia L&D content (compliance training, product training, technical onboarding) where scripts are free of personal data, the captioning workflow presents no GDPR concern beyond standard data processing controls.

The correct Synthesia-to-LMS caption workflow

Five-step workflow for WCAG 2.1 AA compliant Synthesia captions

  1. Review the Synthesia script for technical vocabulary. Before rendering, ensure the script uses the exact spelling, capitalization, and acronym form that should appear in captions — "SAML SSO" not "saml sso," "OAuth 2.0" not "oauth," "FIDO2" not "Fido 2." The script is the primary caption text source; accuracy in the script propagates to accuracy in the captions.
  2. Export the MP4 from Synthesia. Standard Synthesia MP4 export with all overlays, screen-record clips, and callouts included in the rendered output.
  3. Generate a corrected SRT via GlossCap. Submit the Synthesia MP4 audio (or the Synthesia script text) to GlossCap with your company glossary. GlossCap applies glossary-biased timing synchronization — aligning script phrases to the rendered audio with WCAG-compliant precision, using your product vocabulary and regulatory term register to verify and correct technical terms throughout. If you provide the script text directly, GlossCap uses it as the primary text source with audio-aligned timing for each phrase. Output: a corrected SRT with script-accurate vocabulary at audio-precise timing.
  4. Upload both the MP4 and the SRT to your LMS. Both files. In your LMS course builder, attach the SRT (or VTT — convert if required by the LMS) to the video item. Do not rely on LMS auto-captioning. Verify the caption track is attached before publishing. See our TalentLMS, Docebo, Cornerstone OnDemand, and Workday Learning captions references for LMS-specific upload instructions.
  5. Verify caption display in the LMS player before publishing. Play the course as a learner would. Confirm: CC button present and functional; captions appear when enabled; caption text accurate at the opening, midpoint, and end of the video. The end-of-video check detects timing drift — if drift was present in the SRT and not corrected, caption-audio desynchronization will be visible at the end of longer videos.

Building caption upload into the publishing checklist

The MP4-without-SRT failure is a process failure, not a technology failure. L&D teams that consistently produce compliant Synthesia captions make the caption upload an explicit required step in the course-publishing checklist — not an optional enhancement. A minimal Synthesia caption checklist:

Checklist integration is especially important for rapid-production compliance training rollouts — Synthesia's speed advantage (scripts to MP4 in minutes) increases caption-chain failure risk if the captioning workflow does not run at the same operational cadence as the video production.

Synthesia vs. other AI video platforms

HeyGen, D-ID, Pictory, Lumen5, and Descript share all three Synthesia captioning failure modes. These are structural properties of the AI-video-from-text production model: script-derived timing imprecision, the MP4-only export pattern, and STT accuracy degradation on synthetic TTS voice. Synthesia is the market entry point for this category because of its scale, but the workflow documented here applies equally to any AI avatar or AI video platform output. For screen-record training video (Camtasia, Loom), the captioning challenge is different — natural human voice, so STT accuracy on the voice is higher, but no authoritative script text exists. See our Camtasia captions reference for the screen-record comparison.

See GlossCap pricing for Synthesia workflows

FAQ — Synthesia AI video captions

Does Synthesia's built-in caption feature meet WCAG 2.1 AA?

Synthesia's built-in captions derive their text from the script, so caption text accuracy is typically good. Whether they meet WCAG 2.1 AA SC 1.2.2 depends on timing synchronization, which degrades in longer videos. For videos up to ~5 minutes, built-in timing is typically adequate. For longer videos — 10–30 minutes is common for compliance, onboarding, and product training modules — timing drift accumulates and may exceed the WCAG ±2-second synchronization tolerance at the midpoint and end of the video. WCAG 2.1 AA compliance for longer Synthesia videos requires timing-corrected captions, not the raw Synthesia export. And regardless of timing quality, whether Synthesia's captions reach the learner depends on whether the SRT file was exported from Synthesia and uploaded to the LMS alongside the MP4 — which the standard export workflow does not do.

Why does LMS auto-captioning fail on Synthesia video?

LMS auto-captioning and video-platform auto-captioning (YouTube, Vimeo, Wistia, Microsoft Stream) fail on Synthesia video because they run speech-to-text on the MP4 audio, which is a synthetic TTS voice. All major STT systems are trained primarily on natural human speech. Synthetic TTS voices are acoustically different from natural human speech in the dimensions STT phoneme models weight most heavily — prosodic regularity, formant precision patterns, and the absence of natural disfluency signals. The accuracy gap between natural-voice STT and TTS-voice STT concentrates at technical vocabulary: acronyms, product names, regulatory terms, and SDK identifiers fail more severely on TTS voice audio than on natural human voice audio. Since Synthesia is predominantly used for technical training content — compliance, cybersecurity, product training, software onboarding — LMS auto-captioning on Synthesia audio fails most at exactly the vocabulary the training exists to teach. The solution is not better STT: it is using the Synthesia script as the caption text source, bypassing the TTS-voice STT accuracy problem entirely.

HeyGen, D-ID, Pictory — do they have the same captioning problem?

Yes. HeyGen, D-ID, Pictory, Lumen5, Descript, and any other AI avatar or AI video platform share all three Synthesia captioning failure modes: script-derived timing imprecision, the MP4-without-SRT LMS export pattern, and STT accuracy degradation on synthetic TTS voice audio. These are structural properties of the AI-video-from-text production model, not Synthesia-specific limitations. The five-step captioning workflow documented here applies to any AI video platform output: script vocabulary review, MP4 export, external SRT generation with glossary correction and timing alignment, MP4 + SRT LMS upload, and LMS player verification.

What is the difference between script-based captions and STT-based captions for Synthesia video?

Script-based captions use the Synthesia script text as caption content, synchronized to the video audio's timing. STT-based captions run speech-to-text on the MP4 audio to generate both caption text and timing simultaneously. For Synthesia video, script-based captions are the superior approach: (1) the script text is 100% accurate — it is what the author intended — while STT on TTS voice produces accuracy degradation on technical vocabulary; and (2) the script uses the author's intended capitalization and acronym forms, while STT normalization may apply different capitalization to the same terms. The practical challenge with script-based captions is timing precision: the script must be synchronized to the audio with WCAG 2.1 AA precision, and Synthesia's built-in synchronization may not achieve this for longer videos. GlossCap's Synthesia workflow uses the script as the primary text source and applies audio-aligned timing synchronization — combining script-based accuracy with audio-based timing precision — to produce a WCAG-compliant SRT that neither Synthesia's built-in captions nor LMS auto-captioning reliably achieves.

Is there a GDPR issue with sending Synthesia video to a captioning service?

Sending Synthesia video to a captioning service is data processing under GDPR Article 28 if the video or script contains personal data. Synthesia's script-based production model means the video itself typically contains no personal data — no real individuals' faces, voices, or biometric data. The GDPR question is whether the script text or any visual overlay contains personal data: names of real individuals, employee information, customer data, or other GDPR-regulated content. For most Synthesia L&D content (compliance training, product training, technical onboarding scripts without personal data), the captioning workflow presents no GDPR concern beyond standard data processing controls. If scripts contain personal data, a Data Processing Agreement with the captioning service is required. GlossCap maintains DPA terms for enterprise customers. Separately: Synthesia's processing of your script content on its servers is an Article 28 relationship; review Synthesia's DPA terms and data residency options for EU-hosted content if your scripts include sensitive business or personal information.

Further reading