Format reference

SRT captions for training videos: the format, the pitfalls, and how to export them right

Every LMS from TalentLMS to Kaltura accepts SRT. It is 30 years old, has no schema, and fits in an email. This is what an SRT actually looks like on disk, the three mangles auto-captioners leave behind on training content, and how GlossCap exports one that passes an audit on the first try.

TL;DR

An SRT (SubRip Subtitle) file is plain text: a numeric index, a timecode range using HH:MM:SS,mmm with a comma as the fractional separator, one or two lines of caption text, then a blank line. That is the whole spec. Upload it next to your training video and every modern player will render it. What breaks is the content — auto-captioners mangle product names, SDK symbols, and drug names, and those are the exact words auditors sample. GlossCap biases the Whisper decoder toward your company glossary, so kubectl, Docebo, and tirzepatide come out right the first time.

What an SRT file actually contains

The SubRip format was written in 2000 for the SubRip CD-ripping tool. It has never been formally standardized by a standards body, which is both why every player supports it and why small deviations break some tools. The de-facto spec is four-line blocks, separated by a blank line:

1
00:00:03,200 --> 00:00:06,400
[Alex]: First, run kubectl get pods to see
what's running in the cluster.

2
00:00:06,400 --> 00:00:09,100
Then apply the Helm chart with helm install.

3
00:00:09,100 --> 00:00:12,800
[laughter] You'll see the deployment start immediately.

Four things to note because they trip up hand-written SRTs:

Timecodes use commas, not periods, for milliseconds. 00:00:03,200 is valid, 00:00:03.200 is not — and VLC will accept the latter but YouTube Studio will reject it silently.
Blank line between blocks is mandatory. Two consecutive caption blocks with no separator get merged by some parsers and dropped entirely by others.
Line-break inside a block is a hard return, usually \n. Two-line cues are fine; three-line cues push reading speed over the DCMP 160 wpm threshold on most content.
No styling, no positioning. SRT has no native way to express colour, italics-for-emphasis, or caption-placement. Some players informally honor <i> and <b> HTML tags; the WebVTT format on our sibling VTT page is the one to reach for when you need typography.

Why SRT is the default for training video

Training teams pick SRT for one reason: compatibility. Every LMS of consequence — TalentLMS, Docebo, Absorb, Kaltura, Panopto, Bridge, Cornerstone — accepts an SRT drag-and-drop for subtitle tracks on its video player. YouTube accepts SRT for Creator Studio uploads. The HTML5 <track> element technically prefers VTT, but most LMS-embedded players will accept either. If you are exporting exactly one caption file per video for an internal training library, SRT is the format that will still open unaltered in 2036.

The tradeoff is that SRT carries no accessibility metadata. You cannot encode a "caption language" field in the file itself — the filename convention (module-01.en.srt) and the player's manifest carry that signal. If you need CEA-708 colour coding, positional cues for speaker identification, or a schema that an LMS can parse to auto-populate a language dropdown, TTML or STL is the better fit.

The three mangles auto-captioners leave in every SRT

General-purpose speech-to-text models — YouTube auto-captions, Otter, Whisper base — produce a technically-valid SRT on the first pass. The timecodes are fine. The block structure is fine. The text is where training-ops teams burn their hour-per-video.

Product names and SDK symbols. kubectl becomes "cube control"; pytorch becomes "pie torch"; Docebo becomes "doh say boh"; Helm becomes "helmet"; gRPC becomes "g. r. p. c." General-purpose models have never seen these terms in training data often enough to beat the phonetic prior.
Drug and medical names. tirzepatide becomes "tier zip a tide"; semaglutide becomes "sema glue tide"; metformin is usually fine but empagliflozin is a coin flip. On a 45-minute diabetes-care training module, auditors will sample the segments where drug names are spoken because that is where clinical mis-captioning has real downstream cost.
Acronyms and proper nouns. SDK becomes "as decay"; GDPR becomes "G D P R" as four words with no concept grouping; custom product names like "Retool", "Linear", "Figma" get reinterpreted as common nouns or dictionary words whenever they appear out of context. On compliance training, every acronym needs to survive verbatim because the auditor's checklist is built around them.

These are the exact words an auditor sampling WCAG 2.1 AA compliance looks at first — the verbatim-for-dialogue requirement of SC 1.2.2 is load-bearing on technical content, and the generic 90%-accurate track fails on the 10% that matters.

How GlossCap exports a clean SRT

The mechanic is the one sentence on our homepage: captions that know your jargon. You paste in (or sync from Notion / Confluence / Google Docs) the terms your training content uses — product names, SDK symbols, drug names, acronyms. On export, Whisper-large's decoder gets a logit-boost applied to the BPE tokens for each glossary term, so when the acoustic model hears "cube control" with its phonetic ambiguity, the decoder has already been nudged toward kubectl. The biased decode runs once, emits an SRT that is already-correct on the words that matter, and we format it with speaker labels, non-speech-sound cues, and a ≤160 wpm reading-speed ceiling.

What you download is a standards-compliant SRT with: dialogue verbatim, speakers identified where not on-camera, non-speech sounds bracketed ([applause], [alarm]), cues synchronized within sub-second precision, and every term from your glossary preserved the way you wrote it. Drop it next to your training video on any LMS and the audit is boring.

See pricing