Use case · Medical training

Medical training video captions: drug names, procedures, and HIPAA-aware workflow

Medical training video is the highest-stakes content type for caption accuracy in any compliance regime. The proper-noun surface — drug names, procedure names, anatomy, ICD codes, regulatory acronyms — is exactly what general speech-to-text mangles, and exactly what auditors and clinicians both sample. A miscaptioned tirzepatide as "tier zip a tide" is not a comprehension nuisance; it's a clinical and audit-finding risk. Here is what auditors actually check, why general STT can't pass that bar, and the glossary-biased workflow that fixes it.

TL;DR

Medical training video terminology — drug names, procedure names, anatomy, ICD codes — is dense with proper nouns that general speech models have never seen. YouTube auto-captions and vanilla Whisper write phonetic guesses. GlossCap's glossary-biased decoding pulls your formulary or training-deck term list, logit-boosts those tokens into Whisper-large's decoder, and ships SRT/VTT where tirzepatide, empagliflozin, and cholecystectomy land right the first time. HIPAA workflow note: source video stays on your tenant; only audio plus the glossary text is processed; no PHI is required for the captioning pipeline (and shouldn't be present in training content anyway).

The exact words that fail in medical training

Across the medical training video we've audited from L&D leads at health systems, life-sciences companies, and academic medical centres, the failures cluster:

The failures are concentrated on exactly the words a clinician learning the protocol must see correctly.

What auditors and clinicians both sample

Two different sampling regimes converge on the same surface:

Both regimes converge on: drug names, procedure names, ICD codes. These are exactly the categories where glossary-aware captioning has the largest delta over general STT.

The glossary-biased workflow for medical content

  1. One-time formulary or training-deck glossary sync. Most health systems and life-sciences L&D teams already maintain a controlled vocabulary or formulary in Confluence, SharePoint, or a Google Docs folder. Connect that source, or paste a flat list of the drug names, procedure names, and acronyms used across your training catalogue.
  2. Upload modules in batches. A module batch processes the audio against Whisper-large with the formulary tokens logit-boosted into the decoder. The output is SRT/VTT/TTML with the proper-noun surface preserved.
  3. Reviewable edit UI. The amber-highlight UI shows every glossary-applied term in context. A subject-matter reviewer (a pharmacist for the drug modules, a clinician for procedure modules) can scrub through and confirm; corrections feed back into the workspace glossary.
  4. Export and attach in your LMS. Most health-system L&D runs Absorb or Cornerstone; pharma L&D runs Docebo; academic medical centres run Kaltura via Canvas. SRT covers Absorb cleanly (see Absorb captions); VTT for the others.

HIPAA workflow notes

Training video that you produce internally for clinical staff is not, by itself, PHI — it's training content, and a well-run training programme deliberately scrubs PHI from the source material in scripting. The captioning pipeline therefore typically processes audio that contains drug names, procedure descriptions, and clinical workflow instruction, with no patient-identifying information. The relevant operational notes:

Compliance landscape

Health-system and life-sciences training is exposed to multiple overlapping compliance regimes:

The shared denominator across all these is the SC 1.2.2 requirement for synchronized prerecorded captions at high accuracy. The glossary-biased path is the only realistic way to hit that bar on terminology-dense clinical content without per-module manual rework.

See pricing

Related questions

What goes in a medical-training glossary?

The drug formulary used in your training programme; the procedures named in your protocol library; the ICD-10 / DSM-5-TR / CPT codes referenced in coding training; anatomy terms used at the level of detail your modules reach; brand-vs-generic mappings (e.g., Ozempic / semaglutide) so the decoder knows both surface forms are valid.

Can GlossCap handle multi-speaker case discussions?

Yes — Whisper-large handles speaker turns reasonably; the captions reflect overlapping speech as best the audio allows. For formal multi-speaker case-review video, the typical move is per-speaker speaker labels in the SRT (Speaker 1 / Speaker 2) which keeps the audit posture clean even if the audio has overlap.

What about drug names with multiple stress patterns?

The glossary stores the surface form (the spelling). Pronunciation variation across narrators doesn't matter — Whisper-large handles the acoustics; the glossary biases toward the right surface form regardless of which valid pronunciation the narrator used.

Does this work for podcast-style audio modules without video?

Yes — GlossCap accepts audio-only inputs (mp3, wav, m4a). The output is the same SRT/VTT, which can attach to a static-image video shell in your LMS or be served as a transcript alongside the audio.

Further reading