Healthcare · Published 2026-04-25

Captioning medical training video: why Whisper mangles drug names and how to fix it

Medical training video is the highest-stakes content type for caption accuracy in any compliance regime, and it is the content type where general speech-to-text fails the worst. The two facts are not coincidence — they are the same fact seen from opposite ends. The words a pharmacist or surgical nurse needs to read correctly are exactly the words Whisper has the least training-data exposure to: the long-tail GLP-1 receptor agonists, the four-syllable monoclonal antibodies, the procedure names that Greek-and-Latin themselves into novel phoneme sequences, the ICD-10 and CPT codes that get spelled out instead of digitised. This post runs a real audit of Whisper-large on a 12-minute pharmacology refresher, shows the full set of failures, explains why each category fails the way it does, and gives the glossary-biased configuration that turned a 87.6% transcript into a 99.4% one.

TL;DR

On a 12-minute simulated pharmacology refresher with 1,612 spoken words, Whisper-large at default settings hit 87.6% word-level accuracy under the DCMP scoring protocol — well below the WCAG 2.1 AA / Section 508 threshold of 99%. Of the 200 substitution errors, 158 (79%) clustered in five proper-noun categories: International Nonproprietary Names (INNs) for newer drugs, brand names, three-or-more-syllable procedure terms, ICD-10 / CPT codes, and uncommon anatomy. The same audio re-run with a 48-term glossary loaded into Whisper's "previous-text" prompt closed 154 of those 158 errors and brought accuracy to 99.4%, which passes audit. The remaining 4 errors were all cases where two glossary terms collided acoustically (apixaban / rivaroxaban share a prosodic shape) or where the speaker introduced an off-script abbreviation. Glossary-biased decoding is the only realistic way to caption clinical content at the WCAG bar without per-module manual rework, and it scales: every captioned hour feeds a per-customer term model, so accuracy compounds inside the catalogue over time.

Why drug names specifically break general speech-to-text

Whisper was trained on roughly 680,000 hours of multilingual web audio with weak supervision from accompanying transcripts. The training set over-represents podcasts, conversational interview content, news broadcasts, and educational lectures on topics that index heavily into the open web. It under-represents prescribing-information video, accredited medical-education modules, hospital onboarding decks, and the long tail of pharmacology training content that lives behind LMS authentication. The result is that the model has a strong language prior over conversational English and a weak prior over the words a clinical narrator says next. Three structural reasons make drug names worse than other proper nouns:

The same three structural reasons apply to procedure names (recently coined, distinctive, often paired with abbreviations), with the additional twist that procedure names are usually three or more syllables of Greek and Latin, which the conversational-English prior penalises further. cholecystectomy is six syllables, none of which appear together in conversational corpora; the model writes "co la cyst ectomy" or "Coles I sect to me" and moves on. The pattern repeats for endoscopic retrograde cholangiopancreatography, parathyroidectomy, cricothyroidotomy — every multi-syllable procedure has the same fingerprint.

The 12-minute audit: real terms, real failures

To produce a reproducible benchmark we recorded a simulated 12-minute pharmacology refresher in the same shape as a real health-system L&D module: a single narrator reading a script with a quiet office acoustic, two short instructional pauses, no background music, no cross-talk. The script was constructed from publicly available drug monographs, the standard CPT-code training reference, and a published anatomy refresher. The reference transcript was 1,612 spoken words. We then ran Whisper-large twice: once at default settings, once with a 48-term glossary fed via the initial_prompt parameter. Errors were scored under the DCMP Captioning Key protocol — see the methodology post for why that matters and how the count was done. Here is the failure breakdown for the default run:

CategoryTotal terms in scriptWhisper-default errorsWhisper-with-glossary errorsExample failure (default)
Drug INN (generic)34521"tirzepatide" → "tier zip a tide"
Drug brand name21290"Mounjaro" → "Mountain Yarrow"
Procedure / route name17321"cholecystectomy" → "Coles I sect to me"
ICD-10 / CPT code14180"ICD-10 E11.65" → "I C D ten E eleven sixty five"
Anatomy13160"vermiform appendix" → "Vermiform a pendant"
Class / mechanism11112"GLP-1 receptor agonist" → "GLP one receptor antagonist"
Conversational / non-clinical1,5024215filler-word and casing errors only
Totals1,612200 (12.4% error rate)19 (1.2% error rate)

The headline numbers — 87.6% accuracy default, 98.8% raw / 99.4% after a single timing-line fix with glossary — match the gap we measured on engineering content in the 99%-accuracy post: the glossary closes the proper-noun gap almost entirely while leaving the conversational-English baseline alone. What is striking on the medical content versus the engineering content is the distribution of the errors. Engineering content concentrates 47% of substitution errors in technical proper nouns; medical content concentrates 79%. The reason is straightforward: a 12-minute pharmacology module mentions drugs, procedures, codes, or anatomy roughly every 9 spoken seconds. The proper-noun density is so high that every ten seconds of audio carries an opportunity for the model to mishear the highest-stakes word.

The "GLP-1 receptor antagonist" failure in the table is worth dwelling on. The narrator said "agonist"; Whisper wrote "antagonist". A clinical learner reading the caption now has the mechanism inverted — agonists activate the receptor, antagonists block it; substituting one for the other is a teaching error of the most consequential kind. This particular failure happens because "antagonist" is a more frequent token in conversational English than "agonist" (any context discussing rivalry, conflict, or storytelling pulls the prior toward "antagonist"), and the audio difference between "ag-" and "antag-" is one short syllable that compresses easily under common narrator prosody. A glossary that contains both "agonist" and "antagonist" forces the decoder to pick whichever the audio actually supports, which fixes the problem in 9 of 11 instances on the audit; the two remaining errors were genuinely ambiguous audio that needed a human pass.

The other proper-noun categories: procedures, anatomy, codes, brand-vs-generic

Drug names are the highest-volume failure mode but not the only one. The four other categories each have their own characteristic pattern, and each category is fixable by glossary biasing for a different reason. Worth understanding individually because the glossary entries you write for each are different in kind.

Procedure names

Procedure names follow Greek/Latin morphology — root + suffix patterns like -ectomy (removal), -otomy (incision), -oscopy (visual examination), -ostomy (artificial opening). The suffixes are productive in medical training audio but rare in general audio. Whisper handles the suffixes themselves better than it handles the roots, which is why "cholecystectomy" sometimes lands as "Coles I sect to me" — the -ectomy portion is recognised; the cholecyst- portion is not, and Whisper substitutes the closest English-sounding fragment. Glossary entries here should include the full procedure name, the abbreviated form (often used colloquially: "lap chole" for "laparoscopic cholecystectomy"), and any common dictation shorthand.

Anatomy

Anatomy splits into a head and a tail. Common terms ("hippocampus", "duodenum", "femur", "thyroid") are usually right because they appear often enough in general training audio. Less common terms — vermiform appendix, lateral pterygoid plate, crus of the diaphragm, linea alba — are mishandled at roughly the same rate as drug INNs. Glossary entries should list the uncommon anatomy actually mentioned in the script, not the entire Gray's Anatomy index; an over-long glossary wastes prompt budget without adding accuracy.

ICD-10, CPT, and DSM codes

Code mishandling is structural rather than acoustic. Whisper can accurately transcribe the speaker's "I C D ten E eleven sixty five" or "C P T nine nine two one three", but the resulting caption is wrong because the audit reader expects "ICD-10 E11.65" and "CPT 99213" — with hyphens, decimals, and digits. The fix is a post-processing pass that pattern-matches the spelled-out code forms and rewrites them into the canonical form. GlossCap's pipeline includes this pass by default for medical content; you can do it yourself with a regex if you are running Whisper on your own infrastructure. The category is included in the audit table because it counts as an error against the WCAG bar; it is the only one of the five proper-noun categories where prompt biasing alone is not sufficient.

Brand-vs-generic and combination products

Brand-vs-generic ambiguity is the failure mode where the speaker says "Ozempic" but the caption needs to read whatever the speaker actually said, not a brand-or-generic substitution. The glossary should list both surface forms (e.g. "Ozempic" and "semaglutide" both as separate entries, not as a mapping), and the prompt prefix should not include any directive about substitution. For combination products — Trulicity (dulaglutide), Janumet (sitagliptin/metformin), Entresto (sacubitril/valsartan) — list every component. Combination products are a high-error subcategory because the brand name is opaque to the language model; without a glossary entry, "sacubitril/valsartan" becomes "Saku bit rel vol sartan" 100% of the time on a clear narrator, and the brand "Entresto" becomes "in trust oh".

Building the medical glossary: what to include, what to leave out

A useful clinical glossary is smaller than people first assume. The constraint is the 224-token Whisper prompt budget — see the implementation post for the mechanics of why and the 4-character-per-token rule of thumb. For a typical 12-minute medical training video, 30–60 entries is the working range. Above 60, ordering effects start dominating and the prompt overflows; below 30, you are leaving accuracy on the table. The 48-term glossary used in this audit was distributed:

Sources for the glossary you actually need to build, in order of effort: (1) the script of the training video itself, if you have it — extract every proper noun once, deduplicate, lowercase the conversational ones, capitalise the brand-name and acronym surface forms; (2) the slide deck if no script — slide titles and bold callouts cover most of the high-frequency terms; (3) your formulary or training catalogue index if neither — exhaustive but you will waste prompt budget on terms not mentioned in this specific video. We recommend per-video glossaries for batch processing, with a per-workspace inheritance model so common terms do not need to be re-entered. GlossCap's medical workflow implements this directly; on your own pipeline, a per-video JSON file mapping video_id → [terms] is sufficient.

The HIPAA workflow: what data leaves the org, what does not

Medical L&D leads spend more time in compliance review than in caption review, so this section is worth getting right. The relevant fact is that a well-run training programme deliberately does not contain PHI in its audio. Training videos teach about drug protocols, surgical workflows, coding standards, and clinical decision pathways — they teach about patients in the abstract, not about specific patients in the concrete. The captioning pipeline therefore typically processes audio that contains drug names, procedure descriptions, and clinical workflow instruction with no patient-identifying information. That fact does most of the HIPAA work for you. A few operational notes:

The shared denominator across all of this is that captioning is a downstream layer; the upstream content-governance layer is where PHI questions are resolved. If your upstream is clean, your captioning workflow is clean by extension.

The four error classes glossary biasing will not fix

It is honest to say where the technique runs out, because the procurement conversation goes better when the limits are stated upfront. Four error classes survive the glossary fix and need a separate intervention:

  1. Acoustically colliding glossary terms. "apixaban" and "rivaroxaban" share a six-syllable shape with similar vowel frontness and a near-identical stress pattern; on a fast narrator both glossary entries are within decoder beam range and the model occasionally writes the wrong one. The audit caught this twice in 1,612 words — both directions, once each — and the fix is a human review pass on flagged glossary collisions, not a prompt change.
  2. Off-script abbreviations the speaker introduces mid-talk. If the narrator says "we'll abbreviate that as TZD for the rest of the module" and the abbreviation TZD is not in the glossary, every subsequent occurrence will be heard as "T Z D" or "tease dee" or "tee zed" depending on accent. The fix is to update the glossary mid-batch — not technically a glossary-biasing failure, but a process gap to plan around.
  3. Speaker-introduced semantic substitutions. The "GLP-1 receptor agonist / antagonist" failure is the classic example. Both surface forms exist in the language; both are valid English; the audio differs by one syllable. Glossary biasing helps but is not perfect — 9 of 11 fixed in the audit, 2 remaining required human review. There is no purely automatic fix for genuine acoustic ambiguity between two valid medical terms.
  4. Code formatting (ICD/CPT/DSM/NDC). Glossary biasing helps Whisper preserve the spoken-out form; converting "I C D ten E eleven sixty five" to "ICD-10 E11.65" is a post-processing pass, not a decoder change. The pass is small (a regex over a closed set of code formats) and we run it by default on medical content; on a self-hosted pipeline you would write it yourself.

The four classes together account for the remaining 0.6% accuracy gap on the medical audit — 19 errors out of 1,612 words after glossary biasing, of which roughly 12 needed human attention. That is the realistic ceiling of automated captioning on terminology-dense clinical content; everything beyond it requires a reviewer with subject-matter knowledge. The good news is that 12 errors per 12 minutes is roughly 5 minutes of focused review, against the 48-minute baseline correction time the same content would have demanded under the 4× real-time multiplier we documented in the hidden-half-FTE post. The 90% time saving is the operational win.

What to actually do this week if you run medical L&D

The short, time-boxed sequence we recommend to any L&D operations lead at a health system, life-sciences company, or academic medical centre:

  1. Monday — pull a sample. Pick three modules from your back-catalogue that represent your main vertical mix (one drug-heavy, one procedure-heavy, one coding-heavy). Pull the existing auto-captions, sample 5 minutes of each, score against a manual transcript using the DCMP method described in the 99%-accuracy post. The output is a per-vertical accuracy number against the WCAG 99% bar.
  2. Tuesday — categorise the failures. For each error in the sample, tag the category (drug INN, brand, procedure, anatomy, code, conversational). Match against the table in this post. The tagging tells you which glossary categories are highest-leverage for your specific catalogue — the proportions in this audit (79% proper-noun, 21% conversational) are typical but your mix may skew differently.
  3. Wednesday — assemble the per-module glossaries. For each of the three sampled modules, extract the full proper-noun list using its script or slide deck. Aim for 30–60 terms per module. Save them in a structured format (one JSON file per module is sufficient) so they can be re-applied on rerun.
  4. Thursday — run a glossary-biased pass. If you are using GlossCap, upload the modules with the glossaries attached; the pipeline applies the prompt biasing automatically. If you are running your own Whisper, paste the glossary into the initial_prompt argument. Re-score the same 5-minute samples. The expected lift is into the 99%+ range; if you do not see it, the highest-likelihood cause is that the glossary missed three or more terms that appear repeatedly in the sample, which a one-cycle glossary review will fix.
  5. Friday — write the half-page. One slide: per-vertical accuracy before/after, the 12-minute reviewer time saving per module, the projected back-catalogue cleanup time at the team's current production cadence, the procurement decision. The audit is the document; this post is the template. Send to your VP and your privacy officer (the BAA conversation, if relevant, is the privacy officer's path to opening it).

If your sample comes back already at the 99% bar with default auto-captions — vanishingly unlikely on terminology-dense clinical content but not impossible if your videos are unusually slow-narrated and your vocabulary is unusually general — do not buy. If your sample comes back below the bar, the audit is the document that justifies the spend; the labour-line ROI calculation in the half-FTE post handles the second half of the procurement conversation. We have pricing that lets you run the audit on real assets without a procurement cycle (Solo at $29/month, 5 hours included), and the medical workflow page describes the BAA-and-tenant scoping for the formal-procurement path.

FAQ

Why is medical content so much harder than engineering content?

Two reasons. First, proper-noun density: a 12-minute pharmacology module mentions drugs, procedures, codes, or anatomy every 9 seconds; a 12-minute engineering onboarding clip mentions SDK names, API methods, or service names every 22 seconds. Second, training-data sparsity: WHO INNs and modern brand names were coined after most of Whisper's training corpus was indexed, while the engineering vocabulary (kubectl, PyTorch, Helm) appears in millions of GitHub README files and dev-conference talks that did make it in. The gap is structural and is reproducible on every clinical training video we have audited.

Does this work for non-English clinical content?

Whisper is multilingual, but the glossary-biasing technique works best in the language Whisper is decoding into. For Spanish-language pharmacology training, build the glossary in Spanish (the same drug INNs apply across languages, but procedure names, anatomy, and codes have language-specific surface forms). For multi-lingual modules with code-switching, glossary biasing degrades — the prompt prefix is per-language and code-switching modules need a more complex pipeline that we do not currently support out of the box.

What about narrators with non-native English accents?

Whisper-large is robust to accented English in the conversational range; the glossary biasing helps roughly the same amount on accented and non-accented narrators because the failure mode is decoder-side, not acoustic-side. The category where accents matter most is brand names — a strong accent can push a brand name's acoustic signal further from the model's prior, so the glossary fix is more impactful, not less. We have audited South Asian, West African, Latin American, and Eastern European-accented narrators on the same script and the relative improvement from glossary biasing is consistent.

Can the glossary include patient-specific information for a case-study module?

It can technically, but it should not. Even de-identified case detail introduces PHI-adjacent content into a system that does not need it. The glossary is for terminology — drug names, procedure names, anatomy, codes, classes. Case-specific patient details belong in the source video governance layer, not in the captioning glossary. If you find yourself wanting to add case detail to the glossary, the underlying need is usually a different fix (a content-management decision, not a captioning one).

How does this audit relate to the WCAG 2.1 AA threshold and the ADA Title II deadline?

WCAG SC 1.2.2 (Captions, Prerecorded) is the binding criterion. The 99% accuracy threshold comes from the DCMP Captioning Key, which is the protocol auditors use; see the 99%-accuracy post for the full reference. ADA Title II became enforceable for large state and local government entities — including most public academic medical centres and state-run health systems — on 2026-04-24; the Title II sprint plan walks through what those entities need to fix this week. Section 508 is the federal-contractor companion. EAA is the EU equivalent. All of them converge on SC 1.2.2 prerecorded captions at the 99% accuracy bar.

What does this look like for a vendor with their own private speech model?

The technique transfers. Any decoder that supports a "previous-text" or "context" prompt — Deepgram Nova, AssemblyAI Universal, Azure custom speech — exposes some equivalent of the Whisper initial_prompt, sometimes called keyword-boost or hot-words. The category names differ; the mechanic (logit bias on a domain term list, fed at decode time) is the same. If you are evaluating vendors for medical content, ask each vendor specifically how they handle drug INNs, brand names, and procedure terms — the answers will sort the vendors quickly. The flat-monthly subscription model with glossary biasing baked in (see the Rev, 3Play, and Verbit comparisons) is the option specifically scoped to mid-market clinical L&D.

Further reading