Use case · Medical training
Medical training video captions: drug names, procedures, and HIPAA-aware workflow
Medical training video is the highest-stakes content type for caption accuracy in any compliance regime. The proper-noun surface — drug names, procedure names, anatomy, ICD codes, regulatory acronyms — is exactly what general speech-to-text mangles, and exactly what auditors and clinicians both sample. A miscaptioned tirzepatide as "tier zip a tide" is not a comprehension nuisance; it's a clinical and audit-finding risk. Here is what auditors actually check, why general STT can't pass that bar, and the glossary-biased workflow that fixes it.
TL;DR
Medical training video terminology — drug names, procedure names, anatomy, ICD codes — is dense with proper nouns that general speech models have never seen. YouTube auto-captions and vanilla Whisper write phonetic guesses. GlossCap's glossary-biased decoding pulls your formulary or training-deck term list, logit-boosts those tokens into Whisper-large's decoder, and ships SRT/VTT where tirzepatide, empagliflozin, and cholecystectomy land right the first time. HIPAA workflow note: source video stays on your tenant; only audio plus the glossary text is processed; no PHI is required for the captioning pipeline (and shouldn't be present in training content anyway).
The exact words that fail in medical training
Across the medical training video we've audited from L&D leads at health systems, life-sciences companies, and academic medical centres, the failures cluster:
- Drug names. tirzepatide → "tier zip a tide" or "tear zep a tide". semaglutide → "see ma glue tide". empagliflozin → "em pag lif lozin" or split into nonsense fragments. apixaban → "a picks a ban". tofacitinib → "toe fa city nib".
- Procedure names. cholecystectomy → "co la cyst ectomy". endoscopic retrograde cholangiopancreatography → splintered across multiple lines with internal mis-cuts.
- Anatomy. "Hippocampus" usually right; "duodenum" frequently right; less common terms (e.g., vermiform appendix, lateral pterygoid plate) get garbled.
- Codes and acronyms. "ICD-10" → "I C D ten" with hyphenation lost; "DSM-5-TR" → "DSM five TR"; "CPT 99213" → "CPT nine nine two one three".
- Brand-vs-generic. Ozempic ≈ semaglutide; the speaker says one and the auto-caption sometimes blends or substitutes — both wrong relative to the script.
The failures are concentrated on exactly the words a clinician learning the protocol must see correctly.
What auditors and clinicians both sample
Two different sampling regimes converge on the same surface:
- WCAG 2.1 AA / Section 508 audit. An auditor pulls a representative slice of training modules, opens captions on a few sampled segments, and reads. The 99% accuracy threshold is character-level on the standard reading; in practice auditors look for "obvious" failures, and a mis-spelled drug name is the most obvious failure on a clinical training video. See the WCAG 2.1 AA reference.
- Clinical learner. A nurse or pharmacist watching the module reads captions to confirm spelling because they will write it on a chart, look it up in the formulary, or quote it on a patient call. A wrong surface form is a downstream clinical error vector.
Both regimes converge on: drug names, procedure names, ICD codes. These are exactly the categories where glossary-aware captioning has the largest delta over general STT.
The glossary-biased workflow for medical content
- One-time formulary or training-deck glossary sync. Most health systems and life-sciences L&D teams already maintain a controlled vocabulary or formulary in Confluence, SharePoint, or a Google Docs folder. Connect that source, or paste a flat list of the drug names, procedure names, and acronyms used across your training catalogue.
- Upload modules in batches. A module batch processes the audio against Whisper-large with the formulary tokens logit-boosted into the decoder. The output is SRT/VTT/TTML with the proper-noun surface preserved.
- Reviewable edit UI. The amber-highlight UI shows every glossary-applied term in context. A subject-matter reviewer (a pharmacist for the drug modules, a clinician for procedure modules) can scrub through and confirm; corrections feed back into the workspace glossary.
- Export and attach in your LMS. Most health-system L&D runs Absorb or Cornerstone; pharma L&D runs Docebo; academic medical centres run Kaltura via Canvas. SRT covers Absorb cleanly (see Absorb captions); VTT for the others.
HIPAA workflow notes
Training video that you produce internally for clinical staff is not, by itself, PHI — it's training content, and a well-run training programme deliberately scrubs PHI from the source material in scripting. The captioning pipeline therefore typically processes audio that contains drug names, procedure descriptions, and clinical workflow instruction, with no patient-identifying information. The relevant operational notes:
- Source video stays on your tenant. GlossCap pulls a copy for processing; the source remains in your LMS or content store of record.
- Glossary content is term lists only. Drug names, procedures, acronyms — never patient identifiers.
- If a training video does contain PHI (e.g., a recorded case-review with identifying details), that's a content-governance failure upstream of captioning, and should be remediated at the source rather than at the caption layer.
- Business Associate Agreement. If your compliance posture requires a BAA on any system that processes audio derived from clinical operations, talk to us before processing — there are scoping decisions to make about what your specific tenant requires.
Compliance landscape
Health-system and life-sciences training is exposed to multiple overlapping compliance regimes:
- ADA Title II — applies to public hospitals and academic medical centres tied to public universities. Deadline already live as of 2026-04-24.
- Section 508 — applies to any federal contractor or grant recipient; many academic medical centres carry NIH funding and so are in scope.
- Joint Commission and state-level health authority audits — increasingly include accessibility of mandated training content as a sub-bullet under organisational compliance.
- EAA — relevant for EU operations and for US health systems with EU-located staff or remote-learning programmes.
The shared denominator across all these is the SC 1.2.2 requirement for synchronized prerecorded captions at high accuracy. The glossary-biased path is the only realistic way to hit that bar on terminology-dense clinical content without per-module manual rework.
Related questions
What goes in a medical-training glossary?
The drug formulary used in your training programme; the procedures named in your protocol library; the ICD-10 / DSM-5-TR / CPT codes referenced in coding training; anatomy terms used at the level of detail your modules reach; brand-vs-generic mappings (e.g., Ozempic / semaglutide) so the decoder knows both surface forms are valid.
Can GlossCap handle multi-speaker case discussions?
Yes — Whisper-large handles speaker turns reasonably; the captions reflect overlapping speech as best the audio allows. For formal multi-speaker case-review video, the typical move is per-speaker speaker labels in the SRT (Speaker 1 / Speaker 2) which keeps the audit posture clean even if the audio has overlap.
What about drug names with multiple stress patterns?
The glossary stores the surface form (the spelling). Pronunciation variation across narrators doesn't matter — Whisper-large handles the acoustics; the glossary biases toward the right surface form regardless of which valid pronunciation the narrator used.
Does this work for podcast-style audio modules without video?
Yes — GlossCap accepts audio-only inputs (mp3, wav, m4a). The output is the same SRT/VTT, which can attach to a static-image video shell in your LMS or be served as a transcript alongside the audio.