Format reference

VTT captions for training videos: WebVTT for HTML5 embeds

If your training video is embedded via an HTML5 <video> element — on your help center, inside your product, or on a custom LMS — WebVTT is the format the spec actually asks for. This is the file layout, the cue-timing rules, the styling hooks, and when to reach for VTT over SRT.

TL;DR

WebVTT is a W3C Recommendation. A valid VTT file starts with the literal WEBVTT line, a blank line, then one or more cues. Timing uses HH:MM:SS.mmm with a period (not the comma SRT uses). Cues can carry voice tags (<v Alex>), inline styling (<i>, <c.highlight>), and positional hints for speaker identification on split-screen content. The HTML5 <track kind="captions" src="..."> element expects VTT; you will see better cross-browser behaviour than with SRT, and on an embedded custom player it is the right default.

What a VTT file actually looks like

The minimal valid file:

WEBVTT

00:00:03.200 --> 00:00:06.400
<v Alex>First, run kubectl get pods to see
what's running in the cluster.

00:00:06.400 --> 00:00:09.100
Then apply the Helm chart with helm install.

00:00:09.100 --> 00:00:12.800
[laughter] You'll see the deployment start immediately.

Differences from SRT that matter in practice:

The WEBVTT signature is mandatory — the first line must start with it (a UTF-8 BOM is allowed before it). Browsers silently ignore a VTT track missing this header.
Fractional seconds use a period. 00:00:03.200 is valid; 00:00:03,200 is not. This is the single most common migration bug when someone pastes an SRT in and calls it a VTT.
Cues can have IDs. An optional line above the timestamp range gives you a name to reference from CSS pseudo-element styling (::cue(#intro)).
Voice tags identify speakers. <v Alex> is structurally richer than SRT's [Alex]: convention — screen readers can announce it, and CSS can style ::cue(v[voice="Alex"]).
Inline text formatting is legal. <i> for italics, <b> for bold, <u> for underline, <c.className>text</c> for custom CSS classes. SRT has no such hooks.
Cue positioning via settings. Append line:90% align:center to the timestamp line to place a cue at a specific vertical position — useful for content where on-screen text is at the bottom of the frame.
Comments start with NOTE. A line beginning with NOTE (followed by a space or newline) is a comment and is ignored by the parser. Handy for editorial tracking.

When VTT is the right choice (and when it isn't)

Use VTT when: the training video is served by an HTML5 <video> element you control — a custom help-center player, an in-product tutorial overlay, a React/Vue video component, or an LMS that uses an HTML5 player natively (Docebo, Thinkific, Teachable). HTML5 <track> elements are spec'd to consume VTT, and browser rendering is consistent across Chrome, Firefox, Safari, and Edge without vendor shim code. Styling hooks like ::cue selectors also require VTT — you cannot CSS-style an SRT track.

Use SRT when: you are uploading to an LMS that accepts "subtitle files" and you need the widest compatibility matrix. TalentLMS, Absorb, Kaltura, Panopto, and YouTube all accept both but SRT is the lowest-common-denominator. Our SRT page covers that path.

Use both: most training teams export both from GlossCap and pick per destination. The content is identical; the delivery format differs.

The HTML5 embed that actually works

For an embedded training-video player, the <track> element wires up captions:

<video controls crossorigin preload="metadata">
  <source src="/training/module-01.mp4" type="video/mp4">
  <track kind="captions" label="English"
         src="/training/module-01.en.vtt" srclang="en" default>
</video>

Three things to get right:

kind="captions", not kind="subtitles". Captions are for the same-language hearing-impaired audience and include non-speech sound cues; subtitles are for translated dialogue. WCAG 2.1 AA requires captions.
srclang="en" is required when kind="captions". Screen readers and the accessibility API use it to announce the caption language.
default auto-selects the track on page load. Only one default track per video.

The auto-caption problem is the same

Switching file format does not change the underlying speech-recognition accuracy. Whether you export an SRT or a VTT, if the model guessed "cube control" for kubectl, both files have that mistake. The verbatim-for-dialogue requirement of SC 1.2.2 applies to the text, not the wrapper. So the 1-2-hours-per-video hand-fix burden falls on both formats equally — and GlossCap's glossary-biased decode fixes both in the same export pass.

Our SRT page walks through the three mangles auto-captioners leave behind — product names, drug names, and acronyms — and the mitigation is identical for VTT output: you paste in your company glossary, the decoder gets logit-boosted on the BPE tokens for each term, and Docebo comes out as "Docebo" instead of "doh say boh" on the first pass. GlossCap then wraps the result in a WebVTT header and emits voice tags for speaker changes where they are identifiable.

See pricing