Format reference

VTT captions for training videos: WebVTT for HTML5 embeds

If your training video is embedded via an HTML5 <video> element — on your help center, inside your product, or on a custom LMS — WebVTT is the format the spec actually asks for. This is the file layout, the cue-timing rules, the styling hooks, and when to reach for VTT over SRT.

TL;DR

WebVTT is a W3C Recommendation. A valid VTT file starts with the literal WEBVTT line, a blank line, then one or more cues. Timing uses HH:MM:SS.mmm with a period (not the comma SRT uses). Cues can carry voice tags (<v Alex>), inline styling (<i>, <c.highlight>), and positional hints for speaker identification on split-screen content. The HTML5 <track kind="captions" src="..."> element expects VTT; you will see better cross-browser behaviour than with SRT, and on an embedded custom player it is the right default.

What a VTT file actually looks like

The minimal valid file:

WEBVTT

00:00:03.200 --> 00:00:06.400
<v Alex>First, run kubectl get pods to see
what's running in the cluster.

00:00:06.400 --> 00:00:09.100
Then apply the Helm chart with helm install.

00:00:09.100 --> 00:00:12.800
[laughter] You'll see the deployment start immediately.

Differences from SRT that matter in practice:

When VTT is the right choice (and when it isn't)

Use VTT when: the training video is served by an HTML5 <video> element you control — a custom help-center player, an in-product tutorial overlay, a React/Vue video component, or an LMS that uses an HTML5 player natively (Docebo, Thinkific, Teachable). HTML5 <track> elements are spec'd to consume VTT, and browser rendering is consistent across Chrome, Firefox, Safari, and Edge without vendor shim code. Styling hooks like ::cue selectors also require VTT — you cannot CSS-style an SRT track.

Use SRT when: you are uploading to an LMS that accepts "subtitle files" and you need the widest compatibility matrix. TalentLMS, Absorb, Kaltura, Panopto, and YouTube all accept both but SRT is the lowest-common-denominator. Our SRT page covers that path.

Use both: most training teams export both from GlossCap and pick per destination. The content is identical; the delivery format differs.

The HTML5 embed that actually works

For an embedded training-video player, the <track> element wires up captions:

<video controls crossorigin preload="metadata">
  <source src="/training/module-01.mp4" type="video/mp4">
  <track kind="captions" label="English"
         src="/training/module-01.en.vtt" srclang="en" default>
</video>

Three things to get right:

The auto-caption problem is the same

Switching file format does not change the underlying speech-recognition accuracy. Whether you export an SRT or a VTT, if the model guessed "cube control" for kubectl, both files have that mistake. The verbatim-for-dialogue requirement of SC 1.2.2 applies to the text, not the wrapper. So the 1-2-hours-per-video hand-fix burden falls on both formats equally — and GlossCap's glossary-biased decode fixes both in the same export pass.

Our SRT page walks through the three mangles auto-captioners leave behind — product names, drug names, and acronyms — and the mitigation is identical for VTT output: you paste in your company glossary, the decoder gets logit-boosted on the BPE tokens for each term, and Docebo comes out as "Docebo" instead of "doh say boh" on the first pass. GlossCap then wraps the result in a WebVTT header and emits voice tags for speaker changes where they are identifiable.

See pricing

Related questions

Can I style VTT captions with CSS?

Yes, via the ::cue pseudo-element. ::cue { background: rgba(0,0,0,.75); color: #fff; font-family: system-ui; } is a reasonable default. More targeted: ::cue(.highlight) { color: #f4b942; } to style cues that wrap text in <c.highlight>...</c>. Only VTT supports this — SRT has no styling surface.

Is VTT accepted by every LMS?

Most, but not all. Docebo, Thinkific, and Teachable accept VTT natively; Kaltura accepts both VTT and TTML; TalentLMS and Absorb default to SRT but usually accept VTT as an alternative upload. Check the caption-upload UI for each LMS — if the drop zone says "SRT", upload SRT; if it says "VTT or SRT", pick VTT.

Does VTT support chapters and descriptions too?

Yes — kind="chapters" and kind="descriptions" are valid on the HTML5 <track> element with a VTT file as source. Chapter tracks are what power the chapter-marker UI in some players. GlossCap exports chapter files as a v2 feature; right now we emit captions-only VTT.

What's the max cue length?

The spec has no limit but the DCMP Captioning Key recommends ≤160 words per minute and a maximum of 2 lines per cue. GlossCap enforces the 160 wpm cap at export time by splitting long cues across multiple time-synced blocks.

Further reading