Format Guide · Published 2026-06-01

SRT vs VTT vs TTML vs STL: the L&D operator's guide to caption formats, LMS compatibility, and conversion pitfalls

Four caption file formats have survived long enough to matter in the training video world: SRT (the plain-text universal), VTT (the HTML5 native), TTML (the XML structured format inherited from broadcast), and STL (the binary broadcast legacy). Every L&D operator eventually faces a platform that refuses one of them — and the rejection message rarely explains why. This post covers what each format actually contains, which platforms accept which, the five silent corruption bugs that break your captions during conversion before a single frame plays, and a decision tree for choosing the right export format for your LMS stack.

TL;DR

Use SRT unless your platform explicitly requires something else. It is the only format accepted by every major LMS without additional configuration. Use VTT for Panopto, Vimeo, and custom HTML5 players that render styling. Use TTML/DFXP only when Kaltura's REACH integration or a broadcast contract specifically calls for it. Use STL only when a station or broadcaster mandates it and you have a conversion tool — never generate it by hand. The biggest source of upload failures is not the format choice itself but file hygiene: a UTF-8 BOM, Windows CRLF line endings, or a timing-separator typo will corrupt your captions silently in most parsers. Fix the hygiene before you fix the format.

Quick-reference: format support by platform (✓ = accepted, ✗ = rejected, ⚠ = accepted with conditions)
Platform	SRT	VTT	TTML/DFXP	STL	Notes
TalentLMS	✓	✗	✗	✗	SRT only; comma timing separator required
Docebo	⚠	⚠	✗	✗	BCP-47 lang code required (en-US not en); rejects bare ISO 639-1
Absorb LMS	⚠	✗	✗	✗	SRT, no BOM, LF line endings only; CRLF breaks cue boundaries
Kaltura	✓	✓	✓	✗	DFXP (TTML profile) via caption asset API; REACH service uses SRT internally
Cornerstone OnDemand	✓	✓	✗	✗	SRT recommended for broadest player compatibility
Workday Learning	✓	✓	✗	✗	VTT preferred for native HTML5 player; SRT supported via conversion
Brightspace / D2L	✓	✓	✗	✗	HTML5 <track> element accepts both; SRT broadly recommended
Canvas LMS	✓	✓	✗	✗	Instructure Media uses VTT natively; SRT auto-converted on upload
Moodle	✓	✓	✗	✗	HTML5 video <track> element; VTT is the W3C standard for <track>
Panopto	✓	✓	✗	✗	VTT preferred for import; SRT also accepted
Vimeo	✓	✓	✗	✗	API upload requires raw VTT body (not multipart); SRT accepted via UI
Wistia	✓	✓	✗	✗	Requires ISO 639-2/T three-letter codes (eng, not en) via API
YouTube	✓	✓	✓	✗	TTML/DFXP accepted; SRT and VTT indexed for video search
LinkedIn Learning	✓	✓	✗	✗	Custom content upload; LinkedIn's library courses already captioned by LinkedIn
SAP Enable Now	⚠	✗	✗	✗	No built-in STT; MP4 exports require external captioning → SRT by convention
Salesforce Trailhead	✓	✓	✗	✗	Embedded video via Vimeo; format follows Vimeo rules
360Learning	✓	✓	✗	✗	SRT recommended; VTT accepted by underlying HTML5 player
Schoology	✓	✓	✗	✗	HTML5 <track> standard applies; SRT broadly supported
Microsoft Stream	✓	✓	✗	✗	Auto-captions in Teams/Stream; manual SRT/VTT upload supported
Loom	✓	✓	✗	✗	Custom caption upload via SRT; VTT accepted in embed contexts
Camtasia	✓	✓	✗	✗	Authoring tool imports SRT; exports SRT/VTT on publish

The four formats at a glance

The four formats occupy distinct niches that have almost nothing to do with caption quality and almost everything to do with where the format was invented and who adopted it first.

SRT (SubRip Subtitle) was written in 2000 by a Windows developer who ripped subtitles off DVDs to pair with VCD encodes. It is plain text with no formal standard, which is both why every player in the world supports it and why small deviations silently break some players. Its universality makes it the right default for every L&D workflow that does not have a specific reason to use something else.

VTT (WebVTT — Web Video Text Tracks) is the W3C standard for the HTML5 <track> element, published in 2010 and stabilised in W3C Recommendation in 2019. It is structurally similar to SRT but uses a period rather than a comma as the millisecond separator, adds a mandatory WEBVTT header, and supports cue metadata, voice spans, region positioning, and CSS styling hooks that SRT cannot express. Every browser's native video player parses VTT natively. Platforms built on HTML5 video — Panopto, Vimeo, Canvas, Workday, Brightspace — prefer or require it.

TTML (Timed Text Markup Language) is an XML-based format maintained by the W3C Timed Text Working Group, with its lineage in DFXP (Distribution Format Exchange Profile), which was defined for broadcast-content interchange. TTML1 was a W3C Recommendation in 2010; TTML2 in 2018. Its XML structure makes it extensible and styleable (font, colour, positioning, region layout) in ways that SRT and VTT cannot match, which is why broadcast contracts and some high-end LMS configurations require it. For most L&D work, TTML's verbosity is a cost with no benefit — the content accuracy that determines WCAG compliance is the same regardless of format.

STL (EBU STL — Spruce Subtitle Language / EBU Tech 3264) is a binary format defined by the European Broadcasting Union in 1991. Unlike the other three, it is not plain text — it is a fixed-length binary file with a 1024-byte General Subtitle Information (GSI) block followed by fixed-size 128-byte Text and Timing Information (TTI) blocks. You will encounter STL in broadcast-adjacent training programmes (TV production, media-company onboarding), PBS/CPB-contracted content under older Section 508 specifications, and any context where the deliverable is a master subtitled broadcast file rather than a web-served caption sidecar. You will almost never generate STL manually; the right tool for STL is a broadcast-authoring application or a dedicated converter.

SRT: anatomy, pitfalls, and what every L&D LMS actually parses

The exact format

An SRT file is a sequence of cue blocks separated by blank lines. Each block has exactly three required elements: a sequence index (a positive integer, starting at 1), a timecode line, and one or more lines of caption text.

1
00:00:02,500 --> 00:00:05,200
The glossary prompt is applied before
Whisper decodes each segment.

2
00:00:05,200 --> 00:00:08,900
This keeps product names like Kubernetes
and Docebo from being mangled.

3
00:00:09,100 --> 00:00:12,400
The accuracy floor is 99% on DCMP criteria.

The timecode format is HH:MM:SS,mmm --> HH:MM:SS,mmm. The key details:

Hours are always two digits (00, not 0). Some parsers accept single-digit hours; many do not.
The decimal separator between seconds and milliseconds is a comma. This is the most common source of parse failures when converting from VTT (which uses a period).
The separator between start and end timecodes is --> — two hyphens, a greater-than sign, surrounded by a single space on each side. Some tools emit --> without spaces; this breaks VLC, ffmpeg, and several LMS parsers.
Milliseconds are always three digits. 00:00:05,20 is incorrect; it should be 00:00:05,200.

That is the entire format. There is no header, no footer, no encoding declaration, no schema. The simplicity is why it became universal. The lack of a formal spec is why edge cases break things in ways that vary by parser.

The three silent corruption bugs

The SRT bugs that L&D teams encounter are overwhelmingly caused by file hygiene rather than content errors. The three most common are invisible in a text editor and produce platform-specific failures that are difficult to diagnose without knowing what to look for.

Bug 1: UTF-8 BOM at the start of the file

A UTF-8 BOM (Byte Order Mark) is a three-byte sequence (EF BB BF) that Windows applications — including Notepad, older versions of Word, and some Windows-native caption editors — prepend to UTF-8 files as a signal to the operating system. On Windows, it is harmless and invisible. In most SRT parsers on Linux-based LMS backends, it is a three-byte non-printable prefix on the sequence index of the first cue. Parsers that expect the first cue to begin with a digit encounter a non-digit character and skip the first cue entirely — or, in the worst case, fail to parse the file at all and report "invalid caption file" with no further explanation.

The fix: strip the BOM before upload. In Python: open('file.srt', encoding='utf-8-sig').read() reads and strips the BOM automatically. In a hex editor, delete the first three bytes. In VS Code: bottom-right corner shows "UTF-8 with BOM" — switch to "UTF-8" and save.

Affected platforms include TalentLMS (first cue is always dropped if BOM is present) and Absorb LMS (parser fails entirely on BOM-prefixed files, returning a generic upload error).

Bug 2: Windows CRLF line endings

SRT files created on Windows use CRLF line endings (\r\n). SRT files created on Mac or Linux use LF line endings (\n). The SRT spec does not specify which to use. Most parsers handle both. Absorb LMS's SRT parser is sensitive to CRLF in a specific way: it treats the trailing \r character as part of the caption text, which means the blank line that separates cue blocks is interpreted as \r\n followed by nothing, rather than an empty line. In Absorb's parser, this causes cue boundaries to be misidentified — all cues after the first appear concatenated as a single cue, or alternate cues are dropped depending on the parser version.

The fix: convert to LF before upload. In Python: content.replace('\r\n', '\n'). In VS Code: bottom-right corner shows "CRLF" — click it, select "LF," save. In sed: sed -i 's/\r//' file.srt.

Bug 3: Comma-vs-period timing separator mismatch

The most common mistake when converting between SRT and VTT is inverting the millisecond separator. SRT uses a comma (00:00:05,200); VTT uses a period (00:00:05.200). A file with the WEBVTT header stripped and the extension changed from .vtt to .srt still contains period separators. When uploaded to TalentLMS or Docebo, the timecode line fails to parse, and the platform either skips affected cues or reports the file as invalid. The opposite error — a VTT file with comma separators — fails VTT validation entirely because the W3C WebVTT parser spec requires periods.

The fix is mechanical: a text replace of the millisecond separator throughout the file, combined with adding or removing the WEBVTT header. Any reliable conversion tool (ffmpeg, caption.ninja, SubtitleEdit) handles this correctly. The risk is in ad-hoc string operations — always validate the output with a player before uploading to the LMS.

Encoding, line length, and WCAG line limits

SRT files must be UTF-8. The only widespread alternative you will encounter is ISO-8859-1 (Latin-1), which was used for subtitle files before Unicode became universal. A Latin-1 SRT looks correct in a browser or text editor that detects encoding, but fails silently when the LMS backend expects UTF-8 and encounters the high-byte characters that appear in French (é, à, ç), German (ä, ö, ü), Spanish (ñ), and any other language with diacritics. The diagnostic: accented characters appear as garbled sequences (é → é or Ã©) in the caption player. The fix: re-encode the file as UTF-8, either in a text editor ("Save As → UTF-8") or with iconv -f latin1 -t utf-8 input.srt > output.srt.

For WCAG 2.1 AA compliance, Success Criterion 1.2.2 (Captions — Prerecorded) does not specify line length or cue duration. The DCMP (Described and Captioned Media Program) scoring criteria — the accuracy floor used by most compliance auditors — set the following practical limits:

Maximum two lines per cue. Three-line cues obscure video content and indicate that cue boundaries were drawn at the wrong points.
Maximum ~42 characters per line for broadcast-standard readability; up to 80 characters per line is acceptable for non-broadcast training video where the player is a large window. The DCMP guide uses 32–42 characters for broadcast. For LMS-embedded players, 60–80 characters per line is common practice.
Minimum cue duration: 0.3 seconds. Very short cues (100ms or less) from AI segmentation tools are inaudible in the player and cause visible flickering. The standard minimum for readable captions is 1.0 second; the DCMP minimum is 0.3 seconds.
Maximum reading speed: 17 characters per second (3Play Media / DCMP standard). For a 5-second cue, that is 85 characters of text — or about two lines at 42 characters. If your AI tool is generating cues with more text than that, the segmentation algorithm needs adjustment.

These limits are content quality standards, not format standards. They apply equally to SRT, VTT, TTML, and STL. The format determines how the content is represented; the quality criteria determine whether the content passes an audit. For the full accuracy measurement protocol, see our post on the DCMP 99% accuracy threshold.

VTT: anatomy, when it beats SRT, and the Docebo BCP-47 interaction

The WEBVTT header and timing syntax

A VTT file must begin with the string WEBVTT on the first line, optionally followed by a space and a description. A blank line follows the header, then the cue blocks. This header is not optional — an SRT file renamed to .vtt without the header will fail W3C VTT validation and will be rejected by platforms that validate format before accepting the upload.

WEBVTT GlossCap export 2026-06-01

1
00:00:02.500 --> 00:00:05.200
The glossary prompt is applied before
Whisper decodes each segment.

2
00:00:05.200 --> 00:00:08.900
This keeps product names like Kubernetes
and Docebo from being mangled.

NOTE
This file was exported by GlossCap with glossary-biased Whisper decoding.

3
00:00:09.100 --> 00:00:12.400
The accuracy floor is 99% on DCMP criteria.

Key differences from SRT:

The millisecond separator is a period (00:00:05.200), not a comma.
Cue identifiers are optional in VTT (not required as they are in SRT). They can be strings (intro-1) or integers.
The NOTE block allows inline comments that are ignored by players but preserved in the file — useful for QC audit trails.
Hours can be omitted from timecodes when the video is under one hour: 00:02.500 --> 00:05.200. This is valid VTT but invalid SRT.

Styling, voice spans, and region positioning

VTT's differentiating features over SRT are its styling and positioning capabilities. Three mechanisms are relevant to L&D work:

CSS styling via ::cue: VTT files can include a STYLE block that applies CSS to the rendered captions. The ::cue pseudo-element targets cue text; ::cue(b) targets bold tags within cues; ::cue(.class) targets cue-level class annotations. In practice, LMS platforms and video hosts vary wildly in whether they honour VTT styling — Panopto ignores it, Vimeo applies some of it, YouTube ignores all inline style blocks in uploaded VTT files. For training video where the caption content is the compliance artifact (not the visual presentation), styling support is rarely the deciding factor in format choice.

Voice spans: VTT supports <v Speaker Name> tags inside cue text that identify the speaker. This is particularly useful for multi-speaker training video — panel discussions, interview-format compliance training, roleplay scenarios for sales enablement. A cue with a voice span renders as: <v Alex>The kubectl command applies the Helm chart.</v>. Platforms that support voice rendering (notably Panopto) display the speaker name in the caption overlay. Platforms that do not support voice rendering strip the tag and display the text without the speaker attribution. Either way, the cue text remains readable — the voice tag degrades gracefully.

Region positioning: VTT supports REGION blocks that define named screen regions with scroll behaviour. This is primarily a broadcast-accommodation feature (positioning captions away from on-screen text in news programmes) and is rarely used in training video. Most LMS players ignore region positioning from uploaded VTT files even when they otherwise parse VTT correctly.

Platform support: where VTT is preferred

Panopto prefers VTT for manual caption import. Its captions API accepts both SRT and VTT, but the Panopto viewer was rebuilt on a WebVTT-native rendering engine — VTT files import without conversion overhead and preserve cue identifier metadata that Panopto uses in its transcript search index.

Vimeo's caption API requires a raw VTT body in the upload request — not a multipart form-data payload that most HTTP upload libraries default to. The specific API call is a PUT to the text track URI with Content-Type: text/vtt and the raw VTT content as the body. This is documented in the Vimeo API reference but catches many L&D teams who generate working SRT files and assume they can upload SRT through the same endpoint. The Vimeo UI (browser upload) accepts both formats and handles the conversion internally.

Canvas LMS uses Instructure Media (formerly Arc) as its video infrastructure. Instructure Media normalises all caption uploads to VTT internally. SRT files uploaded to Canvas are silently converted at ingest; VTT files skip the conversion step. Either format reaches the learner as VTT — the difference is only in upload reliability and whether the conversion step introduces any timing drift (it does not, in Canvas's current implementation).

The Docebo BCP-47 interaction

The most operationally disruptive VTT platform requirement is Docebo's language code enforcement. Docebo's caption API — the POST /learn/v1/videos/{id}/subtitles endpoint — requires a language_code parameter that must be a valid BCP-47 language tag. The API accepts en-US, en-GB, fr-FR, de-DE, and other full locale codes. It returns HTTP 422 on bare ISO 639-1 two-letter codes (en, fr, de) with a validation error in the response body.

This matters for VTT because the format allows a language attribute in the cue header — but Docebo does not read the VTT file's internal language declaration. The language code is a separate API parameter. A VTT file that is correctly formed and contains accurate captions will be rejected by Docebo if the calling code passes language_code: "en" instead of language_code: "en-US". The SRT format has the same requirement because language code is a separate API field regardless of format. The fix is to update the integration script to always pass the full BCP-47 locale. For a detailed walkthrough of the Docebo subtitle endpoint, see the LMS caption ingestion engineering post.

TTML: when the broadcaster demands XML

File structure

TTML is XML. A minimal TTML file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<tt xml:lang="en-US"
    xmlns="http://www.w3.org/ns/ttml"
    xmlns:tts="http://www.w3.org/ns/ttml#styling"
    xmlns:ttp="http://www.w3.org/ns/ttml#parameter">
  <head>
    <styling>
      <style xml:id="s1"
             tts:fontFamily="Arial"
             tts:fontSize="100%"
             tts:color="white"
             tts:backgroundColor="black"/>
    </styling>
    <layout>
      <region xml:id="r1"
              tts:origin="10% 80%"
              tts:extent="80% 15%"
              tts:displayAlign="before"/>
    </layout>
  </head>
  <body>
    <div>
      <p begin="00:00:02.500" end="00:00:05.200" style="s1" region="r1">
        The glossary prompt is applied before
        <br/>
        Whisper decodes each segment.
      </p>
    </div>
  </body>
</tt>

Key structural elements:

The <tt> root element must declare the xmlns="http://www.w3.org/ns/ttml" namespace. Without the namespace, the file is valid XML but not valid TTML — parsers that check the namespace will reject it.
Timing uses the same period-as-millisecond-separator as VTT: begin="00:00:02.500".
The <head> section is optional but standard. It defines named styles and regions. Omitting it produces a valid minimal TTML file, but platforms that render TTML styling (YouTube, Kaltura) will use their default styles instead.
The <body> → <div> → <p> hierarchy is required. Each <p> element is one cue.

DFXP: the Kaltura profile

DFXP (Distribution Format Exchange Profile) was the original name for a subset of TTML1 designed for content distribution. Kaltura's caption asset API refers to DFXP in its format selector (format=3 in the captionAsset.add call is DFXP/TTML). In practice, a TTML1-compliant file with a .dfxp extension is what Kaltura accepts. TTML2 files are not guaranteed to be accepted unless the Kaltura instance is on a recent enough backend version.

The operational implication for L&D teams using Kaltura: if your captioning workflow produces SRT and you need to send DFXP to Kaltura's REACH service, use ffmpeg to convert. The conversion command is:

ffmpeg -i input.srt output.ttml

ffmpeg generates a valid TTML1 file from SRT, including the namespace declaration and the head/body/div/p structure. The result can be renamed to .dfxp without modification. Timing is preserved exactly — ffmpeg does not alter timecodes in this conversion.

The one thing ffmpeg's SRT→TTML conversion does not preserve is speaker metadata (voice span annotations from VTT). If speaker attribution matters in your Kaltura content, the workflow is VTT→TTML using a VTT-aware converter that maps <v Speaker> to TTML tts:color or a separate region per speaker.

TTML profiles: TTML1, TTML2, EBU-TT, SMPTE-TT

The TTML family has fractured into several profiles used in different broadcast and streaming contexts. For L&D work, the distinction rarely matters — your LMS likely does not accept any TTML profile — but if you encounter a broadcast contract or PBS/CPB accessibility specification, you will see these names:

TTML1 / DFXP: The first W3C Recommendation (2010). What YouTube, Kaltura, and most streaming platforms accept when they accept TTML at all. The safe default.
TTML2: The second W3C Recommendation (2018). Adds animation, audio description, and extended region control. Not widely supported outside specialist players.
EBU-TT (EBU Tech 3350) and EBU-TT-D (EBU Tech 3380): European Broadcasting Union profiles used for European broadcast distribution and HbbTV (Hybrid Broadcast Broadband TV) streaming. Required for some European public broadcasting deliverables.
SMPTE-TT (ST 2052): Society of Motion Picture and Television Engineers profile used in US professional broadcast. Extends TTML1 with SMPTE time code addressing (drop-frame and non-drop-frame). You will see this in NBC/ABC/CBS content deliverables and high-end post-production workflows.

For Section 508 compliance in training video, no specific TTML profile is mandated. Section 508 (36 CFR Part 1194) references WCAG 2.0 Level AA for electronic content. WCAG 2.0 SC 1.2.2 requires captions but does not specify format. Any caption file that delivers synchronized, accurate text in a format the player renders correctly satisfies the standard.

Roundtrip fidelity: what TTML loses in conversion

TTML→SRT roundtrips lose styling information and region positioning — that is expected. The fidelity risk runs the other direction: SRT or VTT files that are converted to TTML and then back to SRT can accumulate timing drift if the conversion tool uses floating-point arithmetic for timecode representation. In practice, this is rare with current tools (ffmpeg, SubtitleEdit, Subtitle Workshop), but it is measurable when you diff the original and roundtripped files frame-by-frame. For compliance audit purposes, always retain the original file as the source of record and document the conversion chain.

STL: the binary broadcast legacy

Why STL is binary

The EBU STL format (EBU Tech 3264, first published 1991) predates the web and was designed for broadcast distribution workflows where subtitles traveled on fixed-length tape-to-tape or satellite-uplink data streams alongside the video signal. Fixed-length binary records with a fixed header block were standard practice for machine-to-machine data interchange before XML or plain-text formats were viable at broadcast scale.

An STL file has two sections: a single 1024-byte General Subtitle Information (GSI) block at the start of the file that contains metadata — code page, display standard, frame rate, time code start, creation date, revision number, programme title, episode title — followed by a variable number of 128-byte Text and Timing Information (TTI) blocks, one per subtitle cue.

The GSI block format matters because the code page field determines how text in the TTI blocks is encoded. GSI code page 850 (Latin-1) is the default for Western European content; GSI code page 856 is used for Hebrew; code page 855 for Cyrillic. A file whose GSI specifies code page 850 but whose TTI blocks contain UTF-8 text will display as garbled characters on all broadcast STL readers. This is why "STL files with Unicode" is an ongoing frustration — the standard predates Unicode, and UTF-8 STL files require non-standard extensions (EBU Subtitling Data Exchange Format extensions) that are not universally supported.

When you encounter STL in L&D

STL surfaces in L&D workflows in four narrow contexts:

Broadcast-affiliated training content. If your organisation produces training video for public television, public access channels, or any broadcaster governed by FCC closed-captioning rules (47 CFR Part 79), the broadcast distribution workflow typically requires STL as the caption deliverable alongside the video master.
PBS/CPB-contracted educational media. Some PBS station content agreements reference EBU STL in their technical delivery specifications, particularly for pre-2015 contracts that were drafted before VTT became standard.
Media and TV production company training. Internal onboarding and production training at broadcast or post-production companies may use STL because the company's existing QC tools read STL natively.
Archive or master delivery. When delivering a caption master file to a broadcaster's archive system, STL may be specified because the archive system was designed around STL and has never been updated to accept VTT.

For L&D teams outside these contexts, STL will not appear in your workflow. The format has no support in any LMS, any HTML5 player, or any web video host. If a vendor or specification document asks for STL for web-delivered training content, the specification is almost certainly copied from a broadcast delivery template and can be negotiated to SRT or VTT without loss of compliance.

STL conversion: tools and workflow

Never attempt to write STL by hand. The binary format, fixed-length fields, and code page encoding make manual creation impractical and error-prone. The correct workflow:

Generate captions in SRT or VTT — the format you are already comfortable with and that your AI captioning tool produces.
Convert to STL using SubtitleEdit (free, Windows/Linux, open source), EZConvert (commercial, broadcast-grade, Mac/Windows), or FAB Subtitler (commercial, the broadcast standard).
Configure the GSI block in the conversion tool: set the code page (850 for Western European), the frame rate (25 for PAL, 29.97 or 30 for NTSC, or the frame rate of your video master), and the time code start (usually 00:00:00:00 for a production master, but confirm with the broadcaster).
Validate the output in the tool before delivering. SubtitleEdit shows the decoded GSI metadata in a properties dialog; verify that the frame rate and code page match the specification.

For organisations that produce STL regularly for broadcast delivery alongside web-hosted training content: maintain separate SRT masters for the web content and generate STL from those masters per-delivery, rather than trying to manage STL files in your LMS workflow. The SRT is your source of truth; STL is a broadcast-specific derivative.

Platform-by-platform deep dives: the decisions that matter

TalentLMS: SRT only, comma separator mandatory

TalentLMS accepts SRT and only SRT. The caption upload surface (Course → Unit → Video → Captions tab) rejects VTT, TTML, and STL with a "file type not supported" error. The SRT parser requires a comma as the millisecond separator — a VTT file with period separators uploaded with an .srt extension will either fail to parse or display all cues at timecode 00:00:00,000 (a common manifestation when the parser misreads the period as a second-level boundary rather than a millisecond separator). The UTF-8 BOM issue causes the first cue to disappear silently in TalentLMS, which is particularly damaging if your video opens with a title card that introduces the module. TalentLMS does not report BOM-related parse failures — the upload succeeds, but the caption track is shorter than expected.

The language code in TalentLMS captions is set via the interface dropdown at upload time, not derived from the file. The SRT file itself requires no language header.

Absorb LMS: SRT with strict hygiene requirements

Absorb LMS has the strictest SRT hygiene requirements of any LMS in common L&D use. The requirements are undocumented in Absorb's public knowledge base (as of mid-2026) but have been confirmed by L&D engineering teams through trial and error and in Absorb's partner-channel support portal:

No UTF-8 BOM. BOM-prefixed files cause a generic "caption upload failed" error.
LF line endings only, no CRLF. CRLF line endings cause cue boundaries to be misidentified — subsequent cues after the first may be concatenated into a single cue, or the file may fail to upload entirely.
No trailing whitespace on cue text lines. Some AI captioning tools add trailing spaces to cue text; Absorb's parser includes those spaces in the rendered caption, which can cause visual artifacts in certain font sizes.
Maximum 500 cues per file. Files with more than 500 cue blocks (approximately 25 minutes of densely captioned content) may fail to upload via the Absorb admin API in some instance configurations. For videos longer than 25 minutes, verify with your Absorb instance configuration whether this limit applies.

Absorb has no bulk-caption upload API as of mid-2026 — each video requires individual SRT upload through the admin interface or the partner API endpoint per course. For a full discussion of bulk-retrofit workflows for Absorb, see the LMS caption ingestion engineering post, which documents the three workarounds: admin CSV import, partner API, and browser automation.

Kaltura: the most format-flexible LMS video platform

Kaltura is the most format-flexible video platform in the LMS market, accepting SRT, VTT, and TTML/DFXP via the caption asset API. The Kaltura REACH service (automated captioning) produces SRT and sends it through the captionAsset.add → uploadToken.add → captionAsset.setContent API sequence automatically. For manual caption uploads — your own SRT files, GlossCap-generated VTT exports, TTML from a broadcast workflow — the same three-step API sequence accepts all three formats. The format field in the captionAsset.add call determines the expected format: 1 = SRT, 2 = DXFP/TTML, 3 = WebVTT.

Kaltura's search index on caption content (which powers its in-video search and transcript features) ingests SRT and VTT with equal fidelity. TTML files are converted to SRT internally before indexing. The xAPI caption-viewed statement that Kaltura emits for Docebo LRS integration references the caption asset ID regardless of format — there is no xAPI-level difference between SRT and VTT caption assets from a learning record store perspective.

Docebo: the BCP-47 and format-agnosticism interaction

Docebo's caption system is more format-agnostic than most — it accepts both SRT and VTT at the subtitle API endpoint. The critical constraint is not the format but the language code parameter. Because Docebo stores captions by language and presents a language selector to learners with multiple-language caption tracks, the BCP-47 locale code is load-bearing metadata. A script that uploads English captions as en instead of en-US will receive HTTP 422, and the learner will see no caption track at all even though the caption file was correctly generated. The fix is to update the integration to always pass full BCP-47 locale codes. For multi-language training content, pass fr-FR for French, de-DE for German, es-ES or es-419 for Spanish (Spain or Latin America), not the bare two-letter codes.

Docebo does not require the SRT or VTT file itself to contain any language metadata — the format file is language-neutral. The language is declared in the API call, not in the file.

Panopto: VTT preferred, transcript search, voice attribution

Panopto's video infrastructure is built on WebVTT natively. Manual caption import via the editor accepts both SRT and VTT; the import workflow converts SRT to VTT internally before indexing. The Panopto captions API (POST /Panopto/api/v1/sessions/{id}/captions) accepts JSON with a fileUrl pointing to either format, but VTT files skip the server-side conversion step and index more reliably in edge cases.

Panopto's transcript search feature uses the caption content for full-text indexing and jump-to-timestamp navigation. Both SRT and VTT produce identical search index entries. Voice spans (<v Speaker> in VTT) are rendered by Panopto as speaker labels in the transcript panel — a capability that has no SRT equivalent. For lecture-capture content with multiple speakers or Q&A sections, VTT with voice spans produces a richer transcript experience in Panopto than SRT without speaker attribution.

Wistia: ISO 639-2/T language codes and API raw-body upload

Wistia has two platform-specific requirements that are easy to miss. First, Wistia's caption API uses ISO 639-2/T three-letter language codes — eng for English, fra for French, deu for German — not the ISO 639-1 two-letter codes (en, fr, de) or BCP-47 locale codes. Passing language=en to the Wistia captions API returns an error. The correct call uses language=eng. Second, the Wistia captions upload for the text track API endpoint accepts the caption file as multipart form data — unlike Vimeo, which requires a raw body. Using the wrong upload method (raw body to a Wistia endpoint that expects multipart) results in a "file not received" error that is easy to misdiagnose as a format problem.

Wistia accepts both SRT and VTT. For the UI upload, format is auto-detected. For the API upload, specify the format in the request body. VTT is recommended for Wistia because it is the format that Wistia's player renders natively — SRT is converted internally at ingest.

The five conversion traps that corrupt captions silently

Format conversion between SRT, VTT, TTML, and STL is a routine part of multi-platform caption management. Most conversion tools handle the straightforward cases correctly. The five failure modes below are where tools fail in ways that produce corrupted captions that appear to upload successfully but display incorrectly or not at all.

Trap 1: Timing separator inversion (comma ↔ period)

The most common conversion bug. A VTT file with period separators is renamed to .srt without replacing periods with commas in the timecode lines. Or an SRT file is renamed to .vtt without replacing commas with periods. The result is a file that looks correct to a casual inspection — it has the right structure, the right text — but every timecode line fails to parse because the millisecond separator is wrong for the target format.

The diagnostic: if all your captions appear at the start of the video (00:00:00,000) or all at the end, or if cues are missing in the player, timing separator inversion is the first thing to check. Open the file in a text editor and look at two or three timecode lines. The separator should be a comma for SRT and a period for VTT.

The fix: a text-replace of the separator throughout the file, combined with adding or removing the WEBVTT header and optional cue identifiers. Any reliable conversion tool performs this step. The risk is in ad-hoc scripts — Python's str.replace(',', '.') applied naively to an SRT file will also replace commas in caption text (numbers in the thousands, lists), producing incorrect VTT output. The correct implementation uses a regex that matches only the timecode line pattern: re.sub(r'(\d{2}:\d{2}:\d{2}),(\d{3})', r'\1.\2', line).

Trap 2: UTF-8 BOM added during Windows copy or save

A correctly formatted SRT or VTT file on a Linux server is downloaded to a Windows workstation for review, edited in Notepad or an older text editor, and saved back as UTF-8. Windows adds a BOM. The file is reuploaded. On platforms that reject BOM (TalentLMS, Absorb), the upload fails or the first cue disappears. On platforms that tolerate BOM (Docebo, Canvas), the file appears correct but may cause issues if the file is later processed by a Linux-based tool in the pipeline.

The diagnostic: download the file, open it in a hex editor (or VS Code with the Hex Editor extension), and check the first three bytes. If they are EF BB BF, a BOM is present. VS Code's status bar shows "UTF-8 with BOM" or "UTF-8" — check before saving.

The fix: strip the BOM before the file leaves the Windows workstation. In VS Code: click "UTF-8 with BOM" in the status bar → "Save with Encoding" → "UTF-8." In Notepad++: Encoding → "Convert to UTF-8 without BOM." Never use Windows Notepad for caption file editing — it will add a BOM even when saving a file that did not previously have one in older Windows versions.

Trap 3: CRLF line endings from Windows editors

As described in the SRT section above, Windows CRLF line endings cause Absorb LMS to misidentify cue boundaries. The same issue — less severely — can affect other Linux-based parsers that are strict about blank-line interpretation. A "blank line" between SRT cues on Windows is \r\n\r\n; on Linux it is \n\n. Parsers that read blank lines byte-by-byte see \r\n as a line containing a carriage return followed by a newline — not an empty line — which delays cue boundary detection by one character.

The diagnostic: if cue text is being merged across what should be boundaries, or if the LMS reports fewer cues than you expect, check the line endings. In VS Code, the status bar shows "CRLF" or "LF." In a hex editor, look for 0D 0A sequences (CRLF) vs 0A only (LF).

The fix: convert to LF before upload. In VS Code: click "CRLF" in the status bar → "LF." In any Unix terminal: sed -i 's/\r//' file.srt. In Python: content.replace('\r\n', '\n'). This is safe to apply unconditionally — converting LF to LF is a no-op.

Trap 4: Zero-indexed sequence numbers

The SRT spec — to the extent that one exists — expects sequence numbers to start at 1. Some batch conversion tools renumber cues starting at 0. The gap between the spec and common practice is that most SRT parsers (including those used by TalentLMS, Panopto, and Vimeo) accept zero-indexed files without complaint. A small number of legacy players and validators, including some older Articulate Storyline export parsers and certain caption-validation tools used in Section 508 audits, reject files where the first sequence number is 0 on the grounds that the de-facto spec says "starts at 1." The fix is to add 1 to every sequence number in the file during post-processing. In Python: re.sub(r'^(\d+)$', lambda m: str(int(m.group(0)) + 1), content, flags=re.MULTILINE).

Trap 5: Sequence number reset across multi-segment video

Video production tools that export per-chapter or per-segment caption files sometimes reset the sequence counter to 1 for each segment. When multiple SRT segments are concatenated into a single file (as is required when uploading to an LMS that expects one caption file per video), the sequence numbers restart mid-file: 1, 2, 3 ... 47, 48, 1, 2, 3 ... A caption file with repeated sequence numbers is technically malformed by the de-facto spec, and some parsers behave unpredictably — the most common behaviour is to treat the reset as the beginning of a new caption block and render only the cues from the last reset point, discarding all cues before the reset. The fix: renumber all cues sequentially after concatenation, from 1 to N, before upload. This is a one-line post-processing step in any scripting language.

Format selection: a decision tree for L&D operators

The right caption format is determined by the intersection of your primary LMS, your video host, and any compliance or broadcast constraints. Work through these questions in order:

Question 1: Does your primary LMS accept only one format?

If you use TalentLMS as your primary LMS: the answer is SRT, and there is no choice to make. TalentLMS does not accept any other format. All subsequent questions become moot for the primary LMS delivery.

If you use Absorb LMS: SRT with strict hygiene (no BOM, LF only). Same situation.

If you use SAP Enable Now: the platform does not natively caption video; caption files are delivered as part of the MP4 export workflow. SRT is the conventional format for the post-production step.

Question 2: Does your video host require VTT?

If you publish to Vimeo via API: VTT is required for the API endpoint. SRT is accepted via the browser UI upload. If your workflow involves automated publishing via the Vimeo API, maintain a VTT export alongside your SRT master.

If you publish to Panopto: VTT is preferred but SRT is accepted. If your content has multiple speakers and you want speaker attribution in the Panopto transcript panel, VTT with voice spans is required — SRT cannot express speaker attribution.

Question 3: Do you need TTML/DFXP?

Only if: (a) you are using Kaltura and the DFXP format selector in the captionAsset API is specifically required by your integration, (b) you are delivering to YouTube and the original video has significant styling (coloured text, repositioned captions) that you want preserved in the uploaded caption, or (c) a broadcast contract explicitly requires TTML or one of its profiles.

In all other cases, TTML adds complexity without benefit for L&D use. Generate SRT or VTT and convert to TTML when specifically required.

Question 4: Do you need STL?

Only if you are delivering to a broadcast system that explicitly requires EBU STL. If the specification document says "STL" and your delivery is for web-based LMS-hosted training, the specification is almost certainly copied from a broadcast template. Clarify with the requester whether SRT or VTT would satisfy the requirement — in most cases, they will.

The pragmatic answer for most L&D teams

Generate both SRT and VTT for every video. The SRT covers every LMS that does not accept VTT (TalentLMS, Absorb). The VTT covers every platform that prefers it (Panopto, Vimeo API, Canvas, Brightspace, Workday). Store both alongside your source video. Upload SRT to the LMS, upload VTT to the video host. Convert to TTML on demand when Kaltura or YouTube requires it. Ignore STL unless you work in broadcast.

GlossCap exports SRT and VTT simultaneously from the same glossary-biased Whisper decode — the two files are generated from the same transcript, so they are guaranteed to have identical content with format-appropriate timing separators and headers. There is no accuracy difference between the SRT and VTT exports.

WCAG 2.1 AA: does format matter for compliance?

No. WCAG 2.1 Success Criterion 1.2.2 (Captions — Prerecorded) requires "captions are provided for all prerecorded audio content in synchronized media." The standard does not specify a caption file format. An SRT file, a VTT file, a TTML file, and an STL file that contain identical accurate caption text and identical timecodes are equally compliant with SC 1.2.2 — provided that the format is one that the player renders correctly. A caption file in a format the player cannot parse is effectively no captions at all, which is a compliance failure — but that is a format-compatibility problem, not a WCAG definition of format.

What actually determines WCAG compliance

The WCAG 2.1 AA caption requirement is satisfied by content accuracy and synchrony, not format:

Accuracy: The DCMP scoring criteria — the accuracy measurement framework used by most compliance auditors — require ≥99% accuracy on a scored word-by-word comparison. Auto-captions from YouTube, Teams, Zoom, and similar platforms typically reach 80–90% on general-vocabulary content and 60–80% on domain-specific vocabulary (product names, drug INNs, regulatory citations, CLI commands). This accuracy gap is what the GlossCap glossary-biased decoding is designed to close. For a detailed breakdown of what 99% means in practice, see our post on the DCMP threshold.
Synchrony: Captions must be synchronized within ±2 seconds of the spoken audio. This is a timing quality requirement, not a format requirement — all four formats support arbitrary timing precision down to the millisecond.
Coverage: All spoken audio must have a corresponding caption, including speaker identification when multiple speakers are present and environmental sounds that are relevant to understanding the content (e.g., "[alarm sounds]" in safety training).

The compliance documentation that an LMS administrator or HR team needs to produce under ADA Title II or Section 508 is a statement that the caption file meets the accuracy and synchrony standards — not a declaration of which file format was used. Section 508 compliance documentation refers to the WCAG 2.0 Level AA standard by citation; it does not prescribe SRT, VTT, TTML, or any other format.

Format and legal exposure

In the OCR complaint investigations and ADA Title II enforcement actions that have been publicly resolved as of mid-2026, no resolution agreement has ever cited a caption file format as a deficiency. The deficiencies cited are: no captions at all, captions that are grossly inaccurate, captions that are unsynchronized with the audio, and captions that exist as files on a server but cannot be accessed through the LMS interface by a learner who needs them. The format is irrelevant if those substantive failures are absent. For the compliance context around ADA Title II enforcement for training video, see our ADA Title II post from the enforcement date.

Validating caption files before upload

The most common cause of LMS "invalid caption file" errors is a hygiene problem, not a content problem. Before blaming the LMS, run the following checks:

VTT validation

The W3C Nu HTML Checker (validator.w3.org/nu/) accepts VTT files for upload and validates them against the W3C WebVTT specification. Paste the file URL or upload the file directly. The validator reports: missing WEBVTT header, malformed timecode lines, invalid cue block structure, and unrecognised tags. This is the authoritative validator for the VTT format — if it passes, the file is correctly formed.

SRT validation

There is no official SRT validator — the format has no formal spec body. The practical validation approach is to open the file in VLC Media Player with a dummy video and confirm that all cues render at the correct times. Alternatively, SubtitleEdit (free, Windows/Linux) opens SRT files and highlights parsing errors with specific line numbers and error descriptions. SubtitleEdit's "Check → Fix Minor Errors" function also auto-corrects common hygiene issues (BOM removal, CRLF→LF, sequence renumbering).

The byte-level check: BOM and CRLF

For the two hygiene issues most likely to cause silent upload failures, a simple Python script checks both in under a second:

import sys

with open(sys.argv[1], 'rb') as f:
    raw = f.read()

if raw[:3] == b'\xef\xbb\xbf':
    print("BOM DETECTED — strip before upload")
else:
    print("No BOM — OK")

if b'\r\n' in raw:
    print("CRLF line endings — convert to LF before upload to Absorb")
else:
    print("LF line endings — OK")

print(f"File size: {len(raw)} bytes, first 20 bytes: {raw[:20]}")

Run this against every SRT file before bulk upload, especially if the file originated on a Windows workstation or was edited outside your standard toolchain.

Timecode range validation

Two timecode problems that are not caught by format validators but cause player failures:

Overlapping cues: If cue N's end time is later than cue N+1's start time, some players render both cues simultaneously, creating caption stack overflow. AI tools with aggressive segmentation sometimes produce very brief cues that overlap by a few milliseconds. The fix: ensure end(N) <= start(N+1) for all adjacent cues.
End before start: A cue where the end timecode is earlier than the start timecode (e.g., 00:01:30,200 --> 00:01:29,800) is invalid. This can be produced by batch conversion tools that round timecodes independently. Players typically skip these cues silently, producing gaps in the caption track.

SubtitleEdit's "Check → Find Overlapping Subtitles" and "Check → Fix Timing" functions autodetect and correct both problems.

The GlossCap format export workflow

GlossCap exports SRT and VTT simultaneously on every caption job. Both files are generated from the same glossary-biased Whisper decode — the transcript is produced once, the glossary-corrected token sequence is shared, and the two format files are written from the same data. There is no accuracy difference between the SRT and VTT exports.

The SRT export:

UTF-8 encoding, no BOM
LF line endings (Linux-safe for all LMS platforms)
Comma millisecond separator (SRT spec)
Sequence numbers starting at 1
Maximum two lines per cue
Maximum 80 characters per line
Minimum 0.5 seconds per cue (AI segmentation minimum after post-processing)

The VTT export:

WEBVTT header on line 1
Period millisecond separator (VTT spec)
Cue identifiers preserved from the SRT sequence numbers
Voice spans for speaker-attributed content (multi-speaker mode)
No STYLE block (styled captions are not a GlossCap feature; caption accuracy is)

For TTML/DFXP conversion — required for some Kaltura integrations and YouTube uploads with styling preservation — GlossCap's export workflow uses ffmpeg to convert the SRT master to TTML after the job completes. The TTML file is available in the download package on request for Team and Org plan accounts. For STL conversion for broadcast delivery: contact the GlossCap team — STL generation requires broadcast-specific metadata (frame rate, code page, time code start) that is input manually per engagement.

The LMS caption ingestion engineering post documents the exact API calls, upload sequences, and language code parameters for TalentLMS, Docebo, Absorb, Kaltura, Panopto, Vimeo, and Wistia. If you are building an automated caption pipeline on top of GlossCap exports, that post is the reference document for the upload layer.

Format-specific notes for vertical use cases

Healthcare and medical training

Medical and clinical training content — HIPAA training modules, EHR workflow training, Joint Commission survey prep — is typically delivered via HealthStream, Relias, or an enterprise LMS like Cornerstone OnDemand. All three platforms accept SRT. HealthStream's caption upload surface is an admin-panel manual upload; Relias supports SRT via the content management portal. Cornerstone OnDemand accepts SRT and VTT. For medical content, the format decision is straightforward: SRT, with glossary-corrected captions for the drug names and clinical vocabulary that Whisper mangles at default settings.

Government and federal training

Federal agencies and federal contractors subject to Section 508 typically use SharePoint Online (with Microsoft Stream), USALearning, or agency-specific LMS platforms. Microsoft Stream accepts SRT and VTT. USALearning content delivery follows the SCORM standard, where the caption file is embedded in the SCORM package — SRT is the conventional format for Articulate Storyline and Lectora SCORM packages that include caption tracks. The Section 508 conformance standard does not specify format; the question is whether the caption file renders correctly in the player used by the target system.

For DOT/FMCSA CDL and HazMat training content — which has the highest proper-noun density of any regulated training category — the format is almost always SRT delivered via TalentLMS, Absorb, or a custom LMS. The critical quality issue is accuracy on regulatory vocabulary (49 CFR citations, CDL endorsement codes, HazMat class names), not format selection. See the full vocabulary breakdown in the DOT/FMCSA transportation training captions reference.

University and academic settings

Higher education institutions use Canvas, Brightspace, Moodle, and Blackboard as their primary LMS platforms. All four accept SRT and VTT. Lecture capture is typically via Panopto or Mediasite, both of which prefer VTT for manual imports. ADA Title II for public universities now requires WCAG 2.1 AA-compliant captions on all prerecorded instructional video — the compliance clock started in April 2026 for institutions covered by the April 24, 2026 effective date. For universities without an existing captioning workflow, the format question comes second: the first question is accuracy, which requires addressing the proper-noun problem in lecture content (faculty names, department-specific terminology, course-specific jargon). The proper noun failure taxonomy post maps each category to the relevant academic vertical.

Corporate L&D with multi-LMS complexity

The most common scenario for a large L&D team is multi-platform: a primary LMS (TalentLMS or Cornerstone or Workday) plus embedded video on Vimeo or Wistia, plus Panopto for lecture capture. The format requirement set for this scenario is:

SRT for the primary LMS (covers TalentLMS, Cornerstone, Workday, Absorb)
VTT for Panopto and Vimeo API
TTML/DFXP only if Kaltura is in the stack

Both SRT and VTT from the same GlossCap job. Store both. Upload the right one to the right platform. The two-file approach eliminates the need to ever do format conversion in the upload pipeline — conversion is where hygiene bugs are introduced.

Frequently asked questions

Does caption format affect SEO?

For video content on YouTube, caption files in SRT, VTT, or TTML are indexed by Google for video search results. The content of the caption text — the actual words — is what drives search visibility; the format is irrelevant to the search crawler. VTT files embedded as HTML5 <track> elements in a webpage's <video> tag are crawlable by Googlebot if the page is publicly indexed. SRT files uploaded to an LMS are not publicly accessible to search crawlers, so they have no direct SEO effect for the LMS-hosted content. The SEO impact of captions for LMS-hosted training content is near zero — the relevant performance metric is completion rate and learner comprehension, not organic search traffic.

Can I use the same SRT file for TalentLMS and Docebo without modification?

Almost. TalentLMS uploads SRT without any language parameter in the file — it uses the dropdown at upload time. Docebo requires a BCP-47 language code as an API parameter, but this is separate from the file content. The SRT file itself can be identical for both platforms. The API call to Docebo needs to include language_code: "en-US" (or the appropriate locale); the TalentLMS upload UI sets language separately. One SRT file, two different upload procedures, no file modification needed.

Our LMS says "invalid caption file" but the SRT looks correct. What's wrong?

In our experience, the cause is one of three things in roughly this order of frequency: (1) UTF-8 BOM — check the first bytes of the file, (2) CRLF line endings — check what your text editor is saving, (3) a timing separator mismatch (period instead of comma) — check a random timecode line. Together, these three causes account for roughly 80% of "invalid caption file" errors on correctly structured SRT content. Run the Python byte-level check from the validation section above and report what it finds before investigating the file structure further.

What format should I give to a video production agency for captioning work?

Give them SRT. It is the format that every post-production tool — Adobe Premiere, DaVinci Resolve, Final Cut Pro, Avid Media Composer — imports and exports natively. A production agency can convert SRT to VTT, TTML, or STL as needed for any subsequent delivery requirement. Giving them VTT adds no value (they will convert it to SRT internally anyway) and giving them TTML adds formatting metadata that will be stripped in post. SRT is the common currency of the captioning industry.

Is DFXP the same as TTML?

DFXP (Distribution Format Exchange Profile) is a profile of TTML1 — a constrained subset of the TTML1 specification designed for content distribution interchange. In practice, "DFXP" and "TTML" are used interchangeably in most platform documentation, including Kaltura's. A TTML1-compliant file with the correct namespace declaration will be accepted by Kaltura regardless of whether it uses a .ttml or .dfxp extension. TTML2 is a superset of TTML1 and is not the same as DFXP — but for LMS and video-host purposes, this distinction does not arise because no current platform requires or distinguishes TTML2-specific features.

Our federal procurement office asked for "Section 508-compliant captions." Which format do they want?

They want captions that meet the WCAG 2.0 Level AA accuracy and synchrony standards — not a specific format. Section 508 (36 CFR Part 1194, updated January 2017) adopts WCAG 2.0 Level AA by reference via the Access Board's ICT Standards and Guidelines. WCAG 2.0 SC 1.2.2 requires synchronized captions but does not specify SRT, VTT, TTML, or any format. If the procurement document specifies a format (SMPTE-TT or DFXP or similar), that language is almost always copied from a broadcast-content delivery template and does not reflect a genuine technical requirement for web-based training content. Clarify with the contracting officer — in practice, they will accept SRT if the content is accurate. If they insist on a specific format, generate SRT first, then convert.

How do I validate a VTT file before upload to Panopto or Vimeo?

Use the W3C Nu HTML Checker at validator.w3.org/nu/ — it accepts VTT file uploads directly and validates against the W3C WebVTT specification. A file that passes the Nu checker will be accepted by Panopto, Vimeo, and any other platform that implements the W3C VTT spec. For a quick local check, VLC Media Player opens VTT files alongside any video (or a blank video) and renders the captions — if VLC renders all cues correctly, the file is correctly formed. For Vimeo API uploads specifically, also verify that your upload code sends the VTT content as a raw request body with Content-Type: text/vtt, not as multipart form data.

Get caption files that work on the first upload

GlossCap exports clean SRT and VTT simultaneously from every caption job — UTF-8 without BOM, LF line endings, correct timing separators, glossary-corrected accuracy for your product names and domain vocabulary. No post-processing hygiene step required before upload to TalentLMS, Docebo, Absorb, Kaltura, Panopto, Vimeo, or Wistia.

See pricing See how it works