Technical · Published 2026-05-31
The proper noun problem in training video captions: 15 categories of words that break auto-captions, and what to do about each one
Every post in this blog is, at some level, about the same underlying problem: automatic speech recognition models were trained on general English, and training video is not general English. This post is the systematic version of that argument. We catalogued every category of word that reliably fails ASR in domain-specific training content — 15 categories in total — and for each one we explain why it fails, what the failure looks like in practice, and what glossary injection can and cannot do about it. If you have read our earlier posts on drug-name failures in medical training and IUPAC chemical-name failures in HazCom training, this post gives those observations their taxonomic home and extends the frame to every other vertical we cover.
TL;DR
ASR models fail on proper nouns for a single root cause: the words that matter most in your training video are the words that appear least frequently in the model's training data. Low frequency means low decoder confidence, which means the model substitutes a phonetically similar high-frequency word instead. Fifteen categories of domain vocabulary reliably trigger this failure: software product names, proprietary acronyms, pharmaceutical INNs, IUPAC chemical names, regulatory citations, financial instrument codes, medical eponyms, person names, geographic and institutional names, operator codes, part numbers, certification codes, CLI commands, regulatory body names, and non-English loanwords. Glossary injection — passing domain terms to the Whisper decoder as a soft prompt — closes 85–95% of proper-noun errors in most verticals. The remaining 5–15% require a post-processing text-normalisation pass for citation formats and code strings, plus a human-review step for low-confidence segments. The engineering implementation and the DCMP accuracy measurement protocol are in companion posts.
Why ASR models fail on domain-specific proper nouns
Training distribution is the root cause
Whisper-large-v3, the state-of-the-art open-source ASR model at the time of writing, was trained on 680,000 hours of speech scraped from the internet. That corpus is dominated by conversational English, broadcast news, audiobooks, YouTube commentary, and podcast dialogue. The vocabulary distribution of those sources is heavily weighted toward the 50,000 most common English words. Domain-specific proper nouns — pharmaceutical international nonproprietary names, IUPAC systematic chemical names, OSHA section citations, Kubernetes sub-commands — appear in that corpus at frequencies orders of magnitude lower than their frequency in your training video.
The practical consequence is this: when the Whisper decoder encounters the acoustic pattern for "etanercept," it has a very small number of training examples where that sound pattern was paired with that orthographic form. It has a very large number of training examples where phonetically overlapping words like "ethernet" appeared. Beam search picks the hypothesis that maximises the posterior over the entire sequence — which means a word the model has seen 500,000 times will beat a word it has seen 300 times, even if the phonetic evidence slightly favours the rarer word.
This is not a flaw in the model. It is a flaw in the deployment context. A model optimised for general English will perform correctly on general English. It will degrade predictably on domain-specific vocabulary, and the degree of degradation scales directly with how far the domain vocabulary lies from the general-English training distribution.
The decoder confidence gap
The substitution pattern is not random. ASR models fail on proper nouns in two systematic ways:
- Phonetically plausible substitution: the model outputs a real English word that sounds like the proper noun. "Etanercept" becomes "ethernet kept." "Kubernetes" becomes "cube ernetus." "SOFR" becomes "soccer" or "sofa." "FinCEN" becomes "fin sin" or "thin skin." These are the most damaging errors because they produce grammatically plausible output that a reviewer reading without audio might not catch.
- Orthographic collapse: the model outputs the correct pronunciation but the wrong spelling. "Datadog" becomes "Data Dog" (split into two words). "CircleCI" becomes "circle CI" (mixed-case collapse). "kubectl" becomes "cube control" or "cube control" (phonetic expansion of the abbreviation). These errors are easier to catch in review because the format is visually wrong, but they are still incorrect for a caption file that will be searched or indexed.
In both cases the underlying cause is the same: the decoder's confidence on the domain-specific token is low enough that a nearby high-frequency alternative wins the beam search.
Why glossary injection helps
Whisper accepts a "prompt" parameter — a string prepended to the decoding context as prior output. When you pass a 50-term glossary as this prompt, the decoder sees those terms as "words that recently appeared in this transcript." The attention mechanism gives extra weight to hypotheses that reuse recently-seen vocabulary. For tokens where the domain form and the phonetically-similar general-English form are close in likelihood, the glossary prompt tips the balance toward the domain form.
This is not magic: it works because the glossary terms appear in the model's training data (just at low frequency), so their token representations exist in the embedding space. The prompt does not teach the model new words; it biases the prior toward words the model already knows, in contexts where those words are plausible. The engineering implementation post covers the prompt-construction algorithm, the 224-token budget, and when biasing fails.
For words that are genuinely outside the model's vocabulary — extremely rare chemicals, proprietary codes, newly-coined product names — glossary injection may not help. Post-processing text normalisation (regex passes for known code patterns) and human review of low-confidence segments are the backstop.
Category 1: Software and platform product names
What it is
SaaS product names, open-source project names, and platform brand names that appear in technology, enterprise software, and developer-tooling training video. This category is most prominent in engineering onboarding, sales enablement, and cybersecurity training.
Why ASR fails
Platform product names are usually CamelCase compound words, portmanteau constructions, or deliberate misspellings designed for trademark distinctiveness. These characteristics make them visually memorable but acoustically ambiguous. Compound words where both halves are common English words (Datadog, Salesforce, ServiceNow, PagerDuty) tend to split into two tokens in the transcript. Coined words with foreign-language roots (Kubernetes from Greek "kybernetes," Kafka from the author's surname) may be decoded correctly when they are common enough in the training corpus, but newer or rarer platform names will degrade consistently.
Representative failure examples
| Spoken | ASR output (default) | Correct |
|---|---|---|
| Kubernetes | "cube ernetus" / "cube ernetes" | Kubernetes |
| PagerDuty | "pager duty" (split) | PagerDuty |
| Datadog | "data dog" (split) | Datadog |
| Terraform | "terra form" (split) | Terraform |
| HashiCorp | "hash he corp" / "hash corp" | HashiCorp |
| Grafana | "gra-FAH-na" → "Grapevine" / "Grafana" (inconsistent) | Grafana |
| Splunk | "Splunk" (usually correct — phonetically simple) | Splunk |
| CircleCI | "circle CI" / "circle SEI" (inconsistent capitalisation) | CircleCI |
| Chronosphere | "chrono sphere" (split) | Chronosphere |
| OpenTelemetry | "open telemetry" (split, inconsistent capitalisation) | OpenTelemetry |
Glossary injection result
This is one of the categories where glossary injection works best. Platform names appear in the model's training data (blog posts, documentation, podcast transcripts) at low but non-negligible frequency, so their token representations are present in the embedding space. A glossary of 20–40 product names from your tech stack will close 90–95% of substitution errors in this category. The remaining failures are usually orthographic (capitalisation, compound vs. split) rather than phonetic, and can be caught by a case-sensitive post-processing pass.
List every product name your training content mentions. Do not list common English words that happen to be used as product names (e.g., "Notion," "Slack") — their general-English frequency is high enough that the model handles them correctly. Focus on names with unusual orthography, portmanteau construction, or foreign-language roots.
Category 2: Proprietary acronyms and initialisms
What it is
Technology and business initialisms that are either spoken letter-by-letter ("S-D-K") or pronounced as a word ("SCIM," "SAML," "OAuth"). This category appears across every vertical — engineering, compliance, HR, sales — because every domain has its own set of initialisms.
Why ASR fails
The decoder faces a binary choice for each initialism: output the letters individually (SDK → "S-D-K") or output a word form (SDK → "sdk"). The model has seen both forms in training data, but the frequencies vary by initialism. For very common initialisms (API, SDK, SQL), the model usually gets the word form correct but may drop or add hyphens. For rarer initialisms, the model may swap to a phonetically similar word: SCIM → "skim," YAML → "yam-ul" → "YAML" (correct) or "yamel" (incorrect), OIDC → "OI-DC" or "OIDC."
Initialisms that are pronounced as acronyms are particularly fragile. SAML is "sam-ul" — close to "Samuel." OAuth is "oh-auth" — the model may produce "O'Auth" with an apostrophe, "o auth" (split), or "OAuth" (correct). The capitalisation convention for multi-word acronyms like "CI/CD" is also inconsistently handled: the model may output "CI CD," "CICD," "C-I/C-D," or "CI/CD" depending on context.
Representative failure examples
| Spoken | ASR output (default) | Correct |
|---|---|---|
| SCIM | "skim" / "scheme" | SCIM |
| OAuth 2.0 | "O auth 2.0" / "O'Auth 2.0" | OAuth 2.0 |
| OIDC | "OI-DC" / "OIDC" | OIDC |
| CI/CD | "CI CD" / "CICD" (no slash) | CI/CD |
| GitOps | "git ops" (split) | GitOps |
| RBAC | "R-BAC" / "are back" | RBAC |
| YAML | "YAML" / "yamel" | YAML |
| JSON | "Jason" (very common failure) | JSON |
| IAM | "I am" / "IAM" | IAM |
| mTLS | "M TLS" / "em TLS" / "mutual TLS" | mTLS |
Glossary injection result
Works well for pronunciation-to-form mapping but does not fully control capitalisation. Including "JSON" in the glossary will pull the decoder away from "Jason," but the capitalisation of the output depends on the surrounding context. A post-processing pass that converts case-insensitive matches of known initialisms to their canonical form (JSON → JSON, yaml → YAML, oauth → OAuth) handles the residual capitalisation errors efficiently. The combination of glossary injection plus a case-normalisation regex pass closes 90–98% of errors in this category.
Category 3: Pharmaceutical drug names (INNs and brand names)
What it is
International nonproprietary names (INNs) — the WHO-assigned generic names for active pharmaceutical ingredients — plus brand names, generic drug names, and combination-product names. This category dominates errors in HIPAA training, medical training video, HealthStream content, Relias training, and FDA-regulated industry training. We published a full post on this category in the context of medical training: Captioning medical training video: why Whisper mangles drug names and how to fix it.
Why ASR fails
INNs are constructed according to WHO stem conventions designed to indicate the drug class phonetically — but those conventions use Greek and Latin morphemes that are rare in general English. Monoclonal antibodies (suffix "-mab"), kinase inhibitors ("-nib"), receptor antagonists ("-sartan"), and PDE5 inhibitors ("-afil") are constructed according to rules that are systematic within pharmacology but do not map to any English word-formation pattern that a general-language ASR model would have learned. A model that has seen "pembrolizumab" 50 times in its training data will consistently lose to "pembrolisome" or "pembro-liz-you-map" as phonetically-close alternatives.
Brand names add a second problem: they are legally required to be orthographically distinct from common English words (FDA requires it as a safety measure), which by design pushes them into low-frequency acoustic territory for ASR. Humira, Keytruda, Eliquis, Xarelto are all brand names constructed to be non-confusable with common words — which is exactly why ASR confuses them.
Representative failure examples
| Drug name | ASR output | Error type |
|---|---|---|
| pembrolizumab | "pembro-liz-you-map" / "pembrolism ab" | Phonetic expansion |
| adalimumab | "adal-imum-ab" / "adaliminium" | Phonetic expansion |
| apixaban | "apex-a-ban" / "apex ban" | Plausible substitution |
| sirolimus | "serious imus" / "siro-limus" | Plausible substitution |
| oseltamivir | "oselta-miver" / "oseltam-i-veer" | Phonetic expansion |
| etanercept | "ethernet kept" / "ethan-er-sept" | Plausible substitution |
| Humira | "Hyundai" / "Humor" / "Humira" (inconsistent) | Plausible substitution |
| Eliquis | "elixirs" / "El-ee-kwis" | Phonetic expansion |
| Xarelto | "Xarelt" / "Sarah owe" | Plausible substitution |
Glossary injection result
Drug names are the category where glossary injection has the most dramatic measurable effect. Our medical training post measured 87.6% accuracy at default settings, rising to 99.4% with a 48-term drug-name glossary — 154 of 158 proper-noun errors closed. The reason the improvement is so large is that drug names, while rare in general English, appear in Whisper's training data in clinical and pharmaceutical contexts. The token representations exist; the prior probability just needs to be elevated by the glossary prompt. Start with every drug that appears in your training content by INN and brand name. Include the stem suffix where the class matters (e.g., "-mab" class: include adalimumab, pembrolizumab, trastuzumab, rituximab — the model will generalise within the class once primed).
Category 4: IUPAC chemical names (systematic names)
What it is
IUPAC systematic chemical names — the internationally standardised chemical nomenclature used in safety data sheets, laboratory training, and hazardous-materials procedures. Dominant in manufacturing training, construction safety training, and EHS/safety training. We published a detailed post on this category: Captioning HazCom training: why SDS chemical names break ASR.
Why ASR fails
IUPAC names are constructed from Greek and Latin roots combined with structural locant numerals embedded mid-word (2,4-dichlorophenoxyacetic acid; 1-bromopropane; methyl 2-methylpropanoate). These numerals are spoken as cardinal numbers that the decoder must reconstitute as embedded locant notation — a format that does not appear in general English text. The resulting acoustic sequence is a long, unusual phoneme chain with embedded number words, which the model has almost never seen as a unit in training data.
IUPAC names are also substantially longer than pharmaceutical INNs — a systematic name for a common industrial solvent may be 8–12 syllables long, giving the decoder many points at which it can deviate from the correct form. The HazCom post measured 87.7% overall accuracy on a 15-minute chemical-safety module, with 83% of errors clustering in IUPAC and related chemical-vocabulary categories.
Representative failure examples
| Chemical name | ASR output | Error type |
|---|---|---|
| 2,4-dichlorophenoxyacetic acid | "2,4-dichloro-phenoxy-acetic acid" (usually split) / "2,4 dichloro phenoxy acetic acid" | Locant format, spacing |
| methyl isocyanate | "methyl ice-o-cyanate" / "methyl isocyanate" (correct sometimes) | Plausible expansion |
| tetrachloroethylene | "tetra-chloro-ethylene" (split) / "tetrachlorethylene" | Spacing, syllable deletion |
| 1,3-butadiene | "1,3 buta-diene" / "1,3 butadiene" (correct form needs punctuation) | Locant format |
| diisocyanate | "die-iso-cyanate" / "di-iso-cyanate" | Prefix expansion |
| trichloroethylene (TCE) | "trichloroethylene" (often correct) / "tri-chloro-ethylene" | Spacing (common enough to handle) |
Glossary injection result
More limited than for pharmaceutical INNs, because the long length of IUPAC names gives the decoder many positions to deviate. Glossary injection raises accuracy significantly for names under 6 syllables, but for 10+ syllable systematic names, a combined approach works better: (1) include the name in the glossary prompt; (2) add a post-processing normalisation pass that converts spoken-number forms back to locant notation (e.g., "two comma four" → "2,4-"); (3) flag any segment where a CAS number or chemical name is spoken for human review. The HazCom post measures the combined approach at 99.1% vs. 87.7% default.
Category 5: Regulatory citations and code references
What it is
Regulatory citation formats used when instructors reference specific rules: CFR sections (29 CFR § 1910.147), NIST SP numbers (NIST SP 800-53), ICD-10/CPT codes (E11.9, 99213), ASTM and ISO standards (ASTM E119, ISO 9001:2015 clause 7.2), and similar. This category is critical in compliance training, government employee training, banking compliance training, HIPAA training, and construction safety training.
Why ASR fails
Regulatory citations are spoken as cardinal numbers but must be formatted as dotted-decimal reference codes. When an instructor says "twenty-nine CFR Section nineteen ten point one forty-seven," the decoder must reconstitute that as "29 CFR § 1910.147." The challenge is the format conversion: the spoken form contains no signal for the decimal point placement, the section-mark character (§), or the period-versus-decimal-point ambiguity. The model defaults to whichever numeric notation appears most often after similar phoneme sequences in its training data, which is usually prose cardinal form: "29 CFR 1910 dot 147" or "29 CFR 1910.147" — sometimes correct, but without the section mark and inconsistently formatted.
ICD-10 codes are spoken with dot notation embedded: "E-11 point 9" for E11.9 (type 2 diabetes mellitus without complications). The model may produce "E 11.9," "E11.9," or "E 11 point 9" — correct in content but inconsistent in format. CPT codes (five-digit numeric) are usually handled correctly as five-digit numbers but the context annotation ("CPT ninety-nine two thirteen" for office visit 99213) sometimes produces "CPT 99213" vs "CPT code 99213" vs the bare number.
Representative failure examples
| Spoken form | ASR output | Intended |
|---|---|---|
| "twenty-nine CFR section nineteen ten point one forty-seven" | "29 CFR 1910.147" / "29 CFR Section 1910 dot 147" | 29 CFR § 1910.147 |
| "NIST SP eight hundred dash fifty-three" | "NIST SP 800-53" / "NIST SP 800 53" | NIST SP 800-53 |
| "ISO nine thousand one colon twenty fifteen clause seven point two" | "ISO 9001:2015 clause 7.2" / "ISO 9001 2015 clause 7.2" | ISO 9001:2015 § 7.2 |
| "E eleven point nine" | "E 11.9" / "E eleven point nine" | E11.9 |
| "HIPAA forty-five CFR one sixty-four point three oh eight" | "HIPAA 45 CFR 164.308" / "HIPAA 45 CFR 164 point 308" | 45 CFR § 164.308 |
| "NFPA seventy E article one thirty" | "NFPA 70E article 130" / "NFPA 70-E Article 130" | NFPA 70E Article 130 |
Glossary injection result
Glossary injection alone is not the right tool for this category. The problem is not word-level substitution but format conversion — the spoken form and the written form are structurally different. The most effective approach is a post-processing text-normalisation pass:
- A regex that matches CFR citation patterns ("(\\d+) CFR [Ss]ection (\\d+)[.](\\d+)") and normalises to canonical form ("$1 CFR § $2.$3").
- A regex for NIST SP, ISO, IEC, and ASTM patterns.
- A lookup for the 30–50 most-frequently-cited codes in your training content, stored as spoken-form → written-form pairs.
Include the major citation stems in the glossary prompt ("29 CFR," "NIST SP 800," "ISO 9001") to anchor the model around the correct numeric sequence, then normalise the format in post-processing. This combined approach handles 90–95% of citation-format errors.
Category 6: Financial codes and instrument names
What it is
Securities identifiers, financial regulation acronyms, market-infrastructure codes, and interest-rate benchmark names. Concentrated in banking compliance training and financial-services compliance more broadly — FINRA Series licensing CE, AML/BSA, capital adequacy, and consumer-compliance training modules.
Why ASR fails
Financial codes fall into two failure sub-types:
Initialisms with non-obvious pronunciation conventions: SOFR is spoken "SO-fur" by market participants. ASR models familiar with the word "soccer" and "sofa" will frequently produce one of those. LIBOR ("LIE-bore") is common enough to handle correctly. CECL ("SEE-sul") maps acoustically to the common English name "Cecil" — the model outputs "Cecil" consistently. CCAR ("SEE-car") maps to "scar." DFAST ("DEE-fast") maps to "deaf" or "daft." These false cognates are particularly damaging in compliance training because a regulator reading the caption file will see a proper-noun error in a key regulatory term.
Alphanumeric codes with spoken-form ambiguity: CUSIP numbers (9 characters, alpha-numeric) are not spoken phonetically in training; they appear in training video when instructors reference a specific instrument. ISIN codes ("I-S-I-N" or "eye-sin") have similar problems. These codes are typically read letter-by-letter in training context, and the model must decide whether to produce the letters or collapse them.
Representative failure examples
| Term | ASR output | Correct |
|---|---|---|
| SOFR | "soccer" / "sofa" / "SOFR" | SOFR |
| CECL | "Cecil" (consistently) | CECL |
| CCAR | "scar" / "CCAR" | CCAR |
| DFAST | "daft" / "D-fast" | DFAST |
| TLAC | "T-LAC" / "t-lack" / "TLAC" | TLAC |
| FinCEN | "fin sin" / "thin skin" / "fin-KEN" | FinCEN |
| OFAC | "oh-fax" / "O-FAC" / "OFAC" | OFAC |
| NSFR | "N-S-F-R" / "NSFR" | NSFR |
| CET1 | "set one" / "C-E-T-1" | CET1 |
| Basel III | "Basil III" / "Basel III" | Basel III |
Glossary injection result
Works well for terms where the phonetic form is close to the correct spelling (OFAC, TLAC, SOFR) once the glossary biases the prior. Works poorly for false-cognate pairs where the general-English word is much more frequent than the financial term (CECL/Cecil, CCAR/scar, Basel/Basil). For false-cognate pairs, include the term in the glossary AND add it to a post-processing substitution list: "Cecil → CECL" in financial contexts. Context detection (check if the surrounding words include "allowance," "current expected," "loss reserve") can disambiguate before substituting.
Category 7: Medical eponyms and clinical nomenclature
What it is
Disease names, surgical procedures, anatomical structures, scoring systems, and clinical syndromes that bear the names of the physicians or researchers who described them. Common in medical training, nursing education, continuing medical education (CME), and HealthStream/Relias module content.
Why ASR fails
Eponymous medical terms are predominantly surnames from German, French, Scottish, Japanese, and other non-English origins that entered English medical vocabulary in the 19th and early 20th centuries. Creutzfeldt-Jakob, Buerger's disease, Paget's disease, Dupuytren's contracture, Virchow's triad — these names have low frequency in general English but high frequency in medical training audio. The phoneme sequences corresponding to these names do not match any high-frequency English words closely enough for the model to produce a plausible substitution, so it typically produces a phonetic expansion that is wrong in both content and format.
Clinical scoring systems present a related problem: APACHE II score, SOFA score, Child-Pugh score, CHADS₂VASc, CHA₂DS₂-VASc, HAS-BLED. These combine acronyms with subscript numerals that are spoken but not intuitively formatted. CHADS₂VASc is spoken "CHADS two VASC" and must appear as CHA₂DS₂-VASc in the transcript — a format conversion problem similar to regulatory citations.
Representative failure examples
| Term | ASR output | Correct |
|---|---|---|
| Creutzfeldt-Jakob disease | "Croits-felt Yak-ob" / "Kreuts-felt Jacob" | Creutzfeldt-Jakob disease |
| Dupuytren's contracture | "Dupe-ee-tren's" / "Dupuy-trin's" | Dupuytren's contracture |
| Virchow's triad | "Veer-kos triad" / "Virco's triad" | Virchow's triad |
| CHA₂DS₂-VASc score | "CHADS 2 VASC" / "Chad-S-VASc" | CHA₂DS₂-VASc |
| APACHE II score | "Apache II" / "Apache 2 score" | APACHE II score |
| Buerger's disease | "Burger's disease" / "Burgers disease" | Buerger's disease |
| Whipple procedure | "Whipple procedure" (usually correct) | Whipple procedure (OK) |
| Bundle of His | "Bundle of his" / "Bundle of hiss" | Bundle of His |
Glossary injection result
Works moderately well for eponyms where the phonetic form is consistent (Virchow, Dupuytren). Works poorly for compound German-origin names (Creutzfeldt-Jakob) where the phoneme sequence is genuinely rare in any English-language corpus. For the most problematic eponyms, include the surname alone in the glossary (just "Creutzfeldt" forces the model to attempt the correct token sequence even if it fails to fully reproduce it) and then use a human-review flag on the segment. Scoring system formats (CHA₂DS₂-VASc, APACHE II) need a post-processing normalisation pass to convert from spoken form to standard clinical notation.
Category 8: Person names (speakers, executives, researchers)
What it is
Names of individuals referenced in training video: industry executives whose names appear in product-strategy or sales training, researchers cited in scientific or compliance training, historical figures referenced in ethics or regulatory-history modules. Appears across all verticals but is most prominent in sales enablement, executive communications training, and academic or research-oriented content.
Why ASR fails
The global executive and researcher pool draws from a far more linguistically diverse name pool than the training-data distribution. English-language training data is heavily weighted toward Anglo-Saxon, Romance-language, and common Slavic surname patterns. East Asian (Chinese, Japanese, Korean), South Asian (Indian), Middle Eastern, and African surnames appear at substantially lower frequencies in the ASR training corpus, even for globally prominent individuals. The result is systematic misrecognition of names from non-dominant phonological backgrounds.
A secondary failure mode is name-length mismatch: some prominent individuals are known by a shortened or informal version of their name in speech but a formal version in text. "Andy" (Jassy) vs. "Andrew Jassy." "Sundar" vs. "Sundar Pichai." The decoder may produce the correct first name but an incorrect last name if the surname is low-frequency.
Representative failure examples
| Person | ASR output | Correct |
|---|---|---|
| Satya Nadella | "Satya Nadela" / "Satya Nah-della" | Satya Nadella |
| Jensen Huang | "Jensen Wang" / "Jensen Wong" | Jensen Huang |
| Yoshua Bengio | "Joshua Ben-jio" / "Yoshua Ben-joe" | Yoshua Bengio |
| Yann LeCun | "Yan La-Cun" / "Yan Le-Kuhn" | Yann LeCun |
| Sundar Pichai | "Sundar Pick-eye" / "Sundar Pi-chay" | Sundar Pichai |
| Arvind Krishna | "Arvind Krish-na" (usually correct) | Arvind Krishna |
| Dario Amodei | "Dario Amo-day" / "Dario Amor-day" | Dario Amodei |
Glossary injection result
Works well for person names because the full name (first + last) is a compact token sequence. Include the names of executives, researchers, or historical figures referenced more than twice in your content. The model will generalise once primed — if you include "Satya Nadella," the decoder will correctly produce "Nadella" in subsequent mentions even without the first name. For any individual who appears repeatedly in a series of training videos, include their name in the standing glossary rather than reconstructing it per-session.
Category 9: Geographic names and institutional names
What it is
City and country names outside the mainstream English-language geographic corpus, names of non-anglophone institutions (universities, hospitals, regulatory bodies in non-English-speaking countries), and names of specific government agencies or office locations referenced in training. Most prominent in international-compliance training, cross-border legal-entity training, and content designed for globally distributed teams.
Why ASR fails
Well-known city names (Tokyo, Paris, Beijing) appear with high frequency in English training data and are handled correctly. Less-known city names in non-English phonological systems (Guadalajara is handled correctly; Wolmers, Ouagadougou, or Kharkiv may not be) fail in proportion to their rarity in English text. Institution names are a compound problem: university names in German (Technische Universität München), Japanese (Waseda Daigaku), or other languages appear in English training data only in transliterated form, and the transliteration quality varies.
Regulatory body names that are acronyms of non-English phrases present a specific failure: DSGVO (the German name for GDPR) is not pronounced in German when spoken in English training context — it is typically read as its German initialism or replaced by "GDPR." The model may produce "DSG-VO," "DSGVO," or silently substitute "GDPR."
Glossary injection result
Include non-English geographic and institutional names that appear more than twice in your training content. For institution names, include the most common English short form (TU Munich vs. Technische Universität München). For non-English regulatory body names, add the English translation equivalent to the glossary alongside the original acronym so the model sees both forms in context. Most well-known international cities will not need glossary support — focus on names that are genuinely rare in English-language text.
Category 10: Operator codes, keyboard shortcuts, and error codes
What it is
Keyboard shortcuts (Ctrl+Alt+Del, Cmd+Shift+P, Alt+F4), Unix/Windows error codes (EACCES, ENOMEM, 0xC0000005), HTTP status codes in context ("the API returns 403 Forbidden"), and system operator codes (power-plant SCADA alarm codes, aviation transponder codes). Prominent in engineering onboarding and cybersecurity training.
Why ASR fails
Operator codes and keyboard shortcuts are mixed speech-and-visual constructs. Instructors speak them in one of two ways: by reading the keys (Ctrl+Alt+Del → "control alt delete") or by reading the code literally (EACCES → "E-A-C-C-E-S" letter by letter, or "e-access"). Neither spoken form is consistent across instructors, and neither maps to a standard ASR output format. The model may produce "control alt delete," "CTRL+ALT+DEL," "Ctrl+alt+del," or various mixed-case permutations — all phonetically correct but none conforming to the standard format expected in technical documentation.
Hexadecimal error codes (0xC0000005 — the Windows access violation exception) are spoken as a mixture of "zero x" or "hex" followed by the hex digits. The decoder may produce "0xC0000005," "0x C0000005," "hex C0000005," or other variants. These errors are difficult to normalise in post-processing because the same hex digit string may appear in many different contexts.
Glossary injection result
Limited effectiveness for this category. Keyboard shortcuts should be normalised in post-processing using a lookup table of standard spoken-form → written-form mappings (Ctrl+Alt+Delete → Ctrl+Alt+Del; command shift P → Cmd+Shift+P). Common error codes that appear in training video should be in the glossary with the spoken form as the key and the canonical code as the value. For hex codes, a post-processing regex that matches "zero x" or "0x" followed by hex digits and normalises the casing and spacing is more reliable than glossary injection.
Category 11: Part numbers, model numbers, and SKUs
What it is
Hardware model numbers (Dell PowerEdge R740xd, HP EliteBook 840 G8), AWS and cloud instance types (m5.xlarge, g4dn.12xlarge), software version numbers (Python 3.11.4, Node.js 20.9.0 LTS), and retail SKUs. Appears in IT asset management training, procurement training, sales enablement, and IT infrastructure onboarding.
Why ASR fails
Alphanumeric strings that mix letters, numbers, and punctuation are never spoken as units in general-language training data. An instructor who says "the PowerEdge R740xd" is speaking a compound word followed by an alphanumeric code that the decoder has almost certainly never seen as a unit in training. The decoder will try to parse "R740xd" as a combination of letter names and numbers: "R-740-X-D" or "R seven forty X-D." Cloud instance types are worse: "m5.xlarge" spoken as "M five dot X-large" may come out as "M5.xlarge," "M 5 X large," or "M-five extra-large" depending on how the decoder interprets the alphanumeric sequence.
Representative failure examples
| Spoken | ASR output | Correct |
|---|---|---|
| "R740xd" | "R-740-X-D" / "R seven forty XD" | R740xd |
| "m5.xlarge" | "M5 X large" / "M-5 extra large" | m5.xlarge |
| "g4dn.12xlarge" | "G4 DN dot 12 X large" / "G4-DN twelve extra large" | g4dn.12xlarge |
| "Python 3.11.4" | "Python 3.11.4" (usually correct) | Python 3.11.4 |
| "Node.js 20.9.0 LTS" | "Node JS 20.9.0 LTS" / "node.js 20.9 LTS" | Node.js 20.9.0 LTS |
| "SKU B07XQXWZ4G" | "SKU B07 X-Q-X-W-Z-4-G" (letter-by-letter) | SKU B07XQXWZ4G |
Glossary injection result
Works for specific model numbers and instance types that appear repeatedly in training content (add "R740xd," "m5.xlarge," "g4dn.12xlarge" to the glossary for content that covers those specific resources). Does not scale to catalogues of hundreds of SKUs — the 224-token glossary budget would be exhausted. For content with many different part numbers, a post-processing normalisation pass that recognises alphanumeric strings matching the pattern of your product catalogue is more practical. Version numbers (Python 3.11.4) are usually handled correctly by the model without glossary support.
Category 12: Certification codes and qualification designations
What it is
Professional certification designations and their associated exam codes: CISSP, CISM, OSCP, PMP, CKA (Certified Kubernetes Administrator), CompTIA A+, AWS SAA-C03, PCNSE, CPA, CFA, FINRA Series 7. Heavily present in professional-development training, IT certification prep content, and cybersecurity training.
Why ASR fails
Certification codes have pronunciation conventions that are domain-specific and inconsistent. CISSP is spoken "sis-P" by security professionals — the model may produce "sips," "CISP," or "CISSP" depending on context. CompTIA is spoken "comp-TEE-uh" — a portmanteau that has no obvious pronunciation from spelling and that the model may produce as "Comptia," "comp-tee-uh," or "Comp-tia." AWS exam codes like "SAA-C03" ("S-A-A C zero three") present the same format-conversion problem as other alphanumeric codes. FINRA Series 7 is spoken naturally and handled correctly. CPA and CFA are common enough to handle correctly.
Glossary injection result
Works well for the specific certifications that appear in your training content. Include the full expansion alongside the abbreviation in your glossary: "CISSP (Certified Information Systems Security Professional)" — seeing both forms in the prompt context helps the model match the spoken form to the abbreviation. CompTIA should always be in the glossary for any content that mentions CompTIA certifications. Exam codes (SAA-C03) are better handled by a post-processing lookup of known exam codes than by glossary injection.
Category 13: CLI commands and code tokens
What it is
Terminal commands, shell syntax, programming language keywords, and function names spoken by an instructor while demonstrating or explaining code. Concentrated in engineering onboarding, developer training, and cybersecurity training where terminal walkthroughs are standard content.
Why ASR fails
CLI commands mix natural-language words with code syntax in ways that are unique to technical instruction. "Run kubectl apply dash F deployment dot yaml" is a sentence that no general-language training corpus contains as a unit. The instructor speaks the flag ("-f") as "dash F" and the model must decide between outputting "dash F," "-f," "-F," or "dash f." The file extension (".yaml") is spoken as "dot yaml" and may come out as ".yaml," ".YAML," ".yml," or "dot yaml" (prose form).
Variable names and function names spoken mid-sentence create a camelCase normalisation problem. "Set the userAuthToken variable" should produce "userAuthToken" but the model may output "user auth token" (separated, lowercase). "Throw a TypeError" should produce "TypeError" but may produce "type error" or "type Error." The model has no signal about whether the programmer intended camelCase or separate words.
Representative failure examples
| Spoken | ASR output | Correct |
|---|---|---|
| "kubectl apply -f deployment.yaml" | "cube control apply -F deployment dot yaml" / "kubectl apply dash F deployment.yaml" | kubectl apply -f deployment.yaml |
| "git rebase --interactive HEAD~3" | "git rebase dash-dash interactive HEAD tilde 3" | git rebase --interactive HEAD~3 |
| "docker build -t myimage:latest ." | "docker build -T my image colon latest dot" / "docker build -t my image latest dot" | docker build -t myimage:latest . |
| "the userAuthToken variable" | "the user auth token variable" / "the user-auth-token variable" | the userAuthToken variable |
| "throw a NullPointerException" | "throw a null pointer exception" / "throw a NullPointerException" | throw a NullPointerException |
| "the .env file" | "the dot env file" / "the .env file" (inconsistent) | the .env file |
Glossary injection result
Works well for command-name tokens (kubectl, docker, terraform, git — include these as they are the anchor of the rest of the command). Works poorly for flag syntax (-f, --interactive, -t) because flags are short and the model may handle them inconsistently regardless of the glossary. A code-block-aware post-processing pass is the right tool for this category: when a segment contains a spoken command sequence, apply a normalisation pass that converts "dash dash" → "--", "dash" (before a single letter) → "-", and "dot yaml" → ".yaml". Combine with a case normalisation pass for camelCase identifiers you can predict.
Category 14: Regulatory body and program names
What it is
Three- and four-letter regulatory body initialisms — US financial (FINRA, FinCEN, OFAC, FFIEC, CFPB, NCUA, OCC, PCAOB), US health (CMS, HHS, OIG, HRSA, AHRQ), international (BIS, IOSCO, BCBS, ESMA, EBA, FINMA), and multi-sector (OSHA, EPA, EEOC, DOL). The failure mode is distinct from general acronyms (Category 2) because regulatory body names have institution-specific pronunciation conventions that general-language models have not learned at sufficient frequency. Prominent in banking compliance, government employee training, and cybersecurity compliance training.
Why ASR fails
Three-letter initialisms are acoustically ambiguous. OFAC ("O-FAC") sounds like "oh fax" or "oh fact." FinCEN ("fin-KEN") sounds like "fin sin," "thin skin," or "finsen." PCAOB ("pee-KAY-ob") sounds like "pea-cave" or "pee-cave." The model does not have a strong prior for these pronunciations because regulatory training documents appear in text form far more often than in audio form — the model has seen the text "FinCEN" many times but the phonetic form "fin-KEN" mapped to that spelling far fewer times.
Agency names that are spoken as words (OSHA, EPA, EEOC) are handled correctly because they appear at high enough frequency in general-language training data. The failure concentration is in agencies with unusual pronunciation conventions or lower general-language frequency.
Representative failure examples
| Agency | ASR output | Correct |
|---|---|---|
| FinCEN | "fin sin" / "thin skin" / "finn-KEN" | FinCEN |
| OFAC | "oh-fax" / "O-FAC" | OFAC |
| PCAOB | "pea-cave" / "pee-KAY-ob" | PCAOB |
| FFIEC | "F-F-I-E-C" / "FFIEC" | FFIEC |
| NCUA | "N-C-U-A" / "NCUA" | NCUA |
| HRSA | "H-R-S-A" / "HRSA" | HRSA |
| BCBS (Basel Committee) | "B-C-B-S" / "BCBS" | BCBS |
| ESMA | "EZ-ma" / "ESMA" | ESMA |
| OSHA | "OSHA" (correct — common enough) | OSHA |
| EPA | "EPA" (correct) | EPA |
Glossary injection result
Works well. Include every regulatory body name that appears in your training content. For multi-agency content (a federal contractor compliance curriculum might reference OSHA, EPA, EEOC, DOL, OFAC, FFIEC, OIG, and more), a standing regulatory-body glossary for your sector is worth building and reusing. The glossary for a bank's compliance training will be 10–15 regulatory body names; for a federal contractor it may be 20–30. All are worth including — the glossary is not a per-video artifact but a standing domain-knowledge resource.
Category 15: Non-English terms and domain-specific loanwords
What it is
Technical loanwords from Japanese (lean/Six Sigma terminology), German (regulatory and quality-management terms), French (legal terminology), Latin (medical, legal, and scientific), and other languages that appear as specialised vocabulary in English-language training. Most prominent in manufacturing training (lean/Six Sigma), compliance training (Latin and French legal terms), and medical training (Latin anatomical and pharmacological terms).
Why ASR fails
This category is more nuanced than the others. Common loanwords that have fully entered English usage (kanban, kaizen, poka-yoke, in vitro, prima facie, force majeure) are handled correctly in most cases because they appear at sufficient frequency in English-language text that the model has learned their pronunciation. The failure concentration is in loanwords that are common in a specific domain but rare in general English.
Japanese lean terminology presents an interesting gradient: kanban and kaizen appear correctly; gemba ("genba" is the correct romanisation but "gemba" is the industry standard) may appear as "gender" or "gamba"; heijunka ("hay-JUNE-kah") may appear as "hedgehog" or "heyunka"; jidoka ("jih-DOH-kah") may appear as "judoka" (a different Japanese word for judo practitioner). Muda/muri/mura are short enough that the model usually handles them. The Six Sigma Greek letter designations (sigma, mu, alpha, beta) are handled correctly as common English borrowings.
German regulatory vocabulary presents a more severe problem: DSGVO, BFSG, BFSGV — these are German-language acronyms that appear in training content for EU-facing teams, spoken as German initialisms even in English context. The model has no reliable mapping for these and will typically output a phonetic expansion or substitute a similar English phoneme cluster.
Representative failure examples
| Term | Language | ASR output | Correct |
|---|---|---|---|
| heijunka | Japanese | "hedgehog" / "heyunka" / "hay-junka" | heijunka |
| jidoka | Japanese | "judoka" / "jidoka" (inconsistent) | jidoka |
| andon (cord) | Japanese | "andon" (usually correct) | andon |
| DSGVO | German | "D-S-G-V-O" / "dis-give-oh" | DSGVO |
| BFSG | German | "B-F-S-G" / "bef-sig" | BFSG |
| inter alia | Latin | "inter alia" (usually correct) | inter alia |
| res ipsa loquitur | Latin | "res ipsa low-kwit-ur" / "res ipsa loquitur" (inconsistent) | res ipsa loquitur |
| in vitro | Latin | "in vitro" (correct) | in vitro |
| kanban | Japanese | "kanban" (correct) | kanban |
| poka-yoke | Japanese | "poca yoke" / "poka-yoke" (inconsistent) | poka-yoke |
Glossary injection result
Works well for terms that appear in Whisper's training data at low-but-nonzero frequency (heijunka, jidoka, poka-yoke — these appear in English-language manufacturing and quality-management documents). Works poorly for terms that are essentially absent from English-language training data (DSGVO, BFSG — include these in the glossary and expect 50–60% improvement, with residual errors requiring human review). For Latin legal and medical terms that are common in your domain, include them in the standing glossary; for rare Latin terms, include them in the per-video prompt.
Per-vertical breakdown: which categories appear where
The table below maps each failure category to the verticals where it dominates. Use it to prioritise your glossary-building effort — focus on the categories that are highest-frequency in your specific training content.
| Vertical | Dominant categories | Glossary scope | Reference page |
|---|---|---|---|
| Engineering onboarding | 1 (platform names), 2 (acronyms), 10 (operator codes), 11 (part numbers), 13 (CLI commands) | Platform stack names, API and SDK acronyms, kubectl/docker/terraform commands, instance types | Engineering onboarding captions |
| Medical / healthcare | 3 (drug INNs), 5 (regulatory citations), 7 (eponyms), 8 (person names), 14 (agency names) | Drug INNs + brand names, procedure names, HIPAA/CMS citation formats, agency names (CMS, OIG, HRSA) | Medical training captions, HIPAA training captions |
| Financial services | 2 (acronyms), 5 (regulatory citations), 6 (financial codes), 14 (agency names) | SOFR, CECL, CCAR, FinCEN, OFAC, FINRA Series codes, CFR citation formats | Banking compliance captions |
| Manufacturing / EHS | 4 (chemical names), 5 (regulatory citations), 12 (certifications), 15 (loanwords) | IUPAC chemical names, OSHA CFR citations, lean/Six Sigma Japanese terms, ISO 9001 citations | Manufacturing training captions, Safety training captions |
| Construction safety | 5 (regulatory citations), 8 (person names in standards), 14 (agency names) | OSHA 29 CFR citation formats, ANSI/ASTM standards, SSHO/competent-person terminology | Construction safety training captions |
| Cybersecurity | 1 (platform names), 2 (acronyms), 5 (regulatory citations), 12 (certifications), 13 (CLI), 14 (agency names) | SIEM/SOAR/XDR/EDR, MITRE ATT&CK technique IDs, CVE numbering, CISSP/OSCP/CompTIA, NIST SP citations, CMMC/PCI citations | Cybersecurity training captions |
| Government / federal | 2 (acronyms), 5 (regulatory citations), 9 (institutional names), 14 (agency names) | 440+ agency abbreviations, OMB M-memo formats, NIST SP 800-53 control IDs, CFR Title 5 citations | Government employee training captions |
| FDA-regulated | 3 (drug names), 4 (chemical names), 5 (regulatory citations), 14 (agency names) | 21 CFR citations, IND/NDA filing terminology, GxP vocabulary (cGMP, GCP, GLP), drug names | FDA-regulated training captions |
| Sales enablement | 1 (platform names), 2 (acronyms), 8 (person names), 11 (part numbers / SKUs) | CRM names (Salesforce, HubSpot), competitor product names, executive names referenced in training | Sales enablement captions |
| Compliance (general) | 2 (acronyms), 5 (regulatory citations), 14 (agency names), 15 (loanwords) | EEOC/DOL/OSHA agency names, Title VII/ADA/ADEA citation formats, Latin legal terms | Compliance training captions |
Where in the pipeline to intervene
The 15 categories require intervention at different stages of the captioning pipeline. Trying to fix all categories with a single tool will fail. A complete solution uses three intervention points:
Stage 1: Decoder-side glossary injection (before transcription)
Pass a domain glossary as the Whisper prompt. This is the primary intervention for categories 1, 2, 3, 6, 7, 8, 9, 12, 14, and 15 — all cases where the correct word exists in the model's vocabulary but needs its prior probability elevated. The prompt is constructed at transcription time and is specific to the content type (medical, engineering, financial). Maintain a standing domain glossary and a per-video extension for content-specific terms.
Technical implementation: the Whisper initial_prompt parameter accepts a string of up to 224 tokens. Structure the glossary as a comma-separated list of domain terms, not a sentence. Shorter prompts that fit in one token per term are more efficient than long expansions. See the engineering implementation post for the prompt-construction algorithm and the 224-token budget management.
Stage 2: Post-processing text normalisation (after transcription, before export)
A regex-based normalisation pass that converts spoken-form patterns to canonical written forms. This is the primary intervention for categories 5 (regulatory citations), 10 (operator codes), 11 (part numbers), and 13 (CLI commands). The normalisation pass:
- Converts CFR citation spoken forms to § notation (regex: match "(\d+) CFR [Ss]ection (\d+)[.](\d+)", output "$1 CFR § $2.$3").
- Normalises keyboard shortcuts (match "control alt delete", output "Ctrl+Alt+Del").
- Fixes initialism capitalisation (match case-insensitive "json", output "JSON"; match "yaml", output "YAML").
- Converts command-flag spoken forms (match "dash dash" before an identifier, output "--").
- Substitutes known false-cognate financial terms (match "Cecil" in financial context, output "CECL").
The normalisation pass must be context-aware: "Cecil" should become "CECL" only in financial training content, not in training videos about a person named Cecil.
Stage 3: Human review of low-confidence segments
Whisper produces per-token log-probability scores that can be used to flag low-confidence segments for human review. A segment where the mean log-probability falls below a threshold (typically around -0.3 to -0.5, depending on the content) is likely to contain a substitution error that glossary injection did not catch. Route these segments to a reviewer with domain knowledge — not a general transcription reviewer, but someone who knows the domain vocabulary. For medical training, that is a clinical educator; for banking compliance, it is a compliance officer. The reviewer's corrections should also feed back into the glossary — if a segment was flagged because the model substituted "Cecil" for "CECL," add "CECL" to the glossary for that domain.
This three-stage pipeline is what the LMS caption ingestion engineering post calls the "generate → normalise → verify" sequence. The generate stage uses glossary injection, the normalise stage applies the post-processing pass, and the verify stage flags low-confidence segments for human review before the file is uploaded to the LMS. For a detailed breakdown of the quality-measurement protocol, see why 99% caption accuracy matters.
Building and maintaining your glossary
What to include
Your glossary should contain every proper noun in your training content that falls into one of the 15 categories above. A practical starting point:
- Every product name, platform name, and software tool mentioned in your training content (Category 1).
- Every domain initialism that is not a common English word in its own right (Category 2 — exclude API, SQL, HTML; include SCIM, OIDC, OAuth, RBAC).
- Every drug name (INN and brand) if you produce medical or pharmaceutical training (Category 3).
- Every chemical name that appears in your SDS training (Category 4).
- The 10–20 regulatory citations most frequently referenced in your domain (Category 5 — as citation stems, not full citations).
- Every financial code, benchmark name, or instrument type your compliance training references (Category 6).
- Every medical eponym, scoring system, or clinical procedure name (Category 7).
- Every person name referenced more than twice across your content (Category 8).
- Every certification name and code that appears in your training (Category 12).
- Every regulatory agency name that is not a common English word (Category 14 — exclude EPA, OSHA as they're common enough; include FinCEN, OFAC, PCAOB).
- Every domain-specific loanword that is not common in general English (Category 15 — include heijunka, jidoka, DSGVO; exclude kanban, kaizen as they're common enough).
What not to include
Do not include common English words even when they have technical meanings in your domain. "Database," "server," "network," "process," "role," "policy," "record," "account" — these are all technical terms in IT, but they appear at high frequency in general English and the model handles them correctly. Including them in the glossary wastes token budget and may introduce unexpected biasing on surrounding context. A glossary of 50–200 terms that are genuinely low-frequency is more effective than a glossary of 500 terms that includes common words alongside rare ones.
Glossary size and the token budget
The Whisper prompt window is 224 tokens. A token is roughly 3–4 characters in English. A comma-separated glossary list uses approximately 1–2 tokens per term plus the comma. A well-constructed glossary of 80 terms fits in around 120–150 tokens, leaving headroom for a sentence-length context setter ("The following terms appear in this training: [glossary]"). Avoid exceeding 200 tokens in the prompt — the remaining 24 tokens are needed for the model's own initialisation overhead, and overflowing the window can cause degraded generation quality on the first few seconds of audio.
Maintenance cadence
The standing glossary should be reviewed quarterly or whenever a major update to your product, regulatory environment, or drug formulary occurs. Per-video extensions should be added when a video covers content significantly different from the standing glossary. Terms that were added for a specific content series but no longer appear in new content can be retired — a smaller, more focused glossary outperforms a large, diffuse one.
Feed the human-review log back into the glossary. Every correction made by a domain reviewer is a data point: the term was either absent from the glossary (add it) or present but not effective (investigate whether the token representation exists in the model — if not, the term needs a post-processing workaround instead). The glossary vs. prompting vs. fine-tuning strategy post covers when glossary injection reaches its limits and when a stronger intervention (per-customer compounding glossary model or LoRA fine-tuning) is warranted.
Measuring improvement across categories
The natural measurement protocol for this taxonomy is to audit a sample of your training content using the DCMP scoring protocol described in why 99% caption accuracy matters — then classify each error by which of the 15 categories it falls into. This tells you which categories are driving the most accuracy loss in your specific content and where to focus glossary and post-processing investment.
A typical audit of 10 minutes of technical training video without glossary support will find error concentrations in 3–5 of the 15 categories. Very few content types exhibit all 15 simultaneously. Engineering onboarding will concentrate in categories 1, 2, 10, 13. Medical content will concentrate in 3, 7, 14. Financial compliance will concentrate in 2, 5, 6, 14. Build the audit protocol into your quality-control workflow: every new content series should start with a 10-minute spot-audit to identify the dominant error categories before the full glossary is built.
Once you know your dominant categories, you can set realistic targets. Categories 1, 2, 3, 8, 12, 14, and 15 typically reach 95%+ accuracy with glossary injection alone. Categories 4 and 7 (long systematic names and eponyms) require glossary injection plus a post-processing pass and typically reach 90–95%. Categories 5, 10, 11, and 13 (citation formats, operator codes, part numbers, CLI commands) require primarily post-processing normalisation and typically reach 95%+ with a well-tuned normalisation pass. Category 6 (financial codes) requires both glossary injection for the phonetic false-cognate pairs and post-processing for format normalisation.
FAQ
Does upgrading to a larger Whisper model (medium → large-v3) fix the proper noun problem?
Partially, but not enough to close the compliance gap without domain adaptation. Larger models have seen the same training data, just at higher capacity — they will perform better on proper nouns that appear at moderate frequency in training data (common product names, well-known drug names) but will not close the gap on genuinely rare terms (new regulatory codes, obscure chemical names, non-English proper nouns). The improvement from medium to large-v3 on general English is substantial (roughly 3–5% WER reduction on benchmark datasets); the improvement on domain-specific proper nouns is typically 1–2% — meaningful, but not sufficient to replace domain adaptation. Glossary injection on large-v3 dramatically outperforms large-v3 without glossary and massively outperforms medium without glossary. Invest in glossary depth before model size.
Can I use phonetic respelling to help the model transcribe terms it consistently misfires on?
Yes, as a workaround for Category 4 and 7 terms where the correct orthographic form genuinely doesn't appear in the model's vocabulary. Phonetic respelling means including both the pronunciation hint and the correct spelling in the glossary prompt: "heijunka (hay-JUNE-kah)." The model sees both forms and uses the phonetic hint to match the acoustic input to the correct spelling. This technique works best for short terms under 4 syllables. For longer terms, the phonetic expansion may occupy too many tokens to be efficient.
Which of the 15 categories is hardest to fix with glossary injection alone?
Category 4 (IUPAC chemical names) and Category 5 (regulatory citations) are the hardest. IUPAC names are long enough that the decoder can deviate from the glossary form at multiple points, and the embedded locant numerals require format conversion that glossary injection cannot provide. Regulatory citation formats require structural conversion (spoken cardinal numbers → dotted-decimal notation with section marks) that is also beyond what a glossary prompt can do. Both categories require post-processing normalisation passes in addition to glossary injection. Category 13 (CLI commands) is a close third — flag-syntax normalisation requires a code-aware post-processor, not just a word-level glossary.
Do the same categories apply to live captioning (Zoom, Teams, Webex) as to recorded video captioning?
The categories are the same but the intervention options differ. Live captioning uses streaming ASR (typically Google Cloud Speech-to-Text, Azure Cognitive Services, or similar) that accepts a phrase hints/boost parameter analogous to Whisper's prompt — this is the live-captioning equivalent of glossary injection. Post-processing normalisation is harder to apply in real-time because the output is streamed, but a lag-tolerant correction layer (applying normalisation with a 2–5 second delay) can handle citation-format and initialism-capitalisation corrections. Human review is not feasible in real-time. The practical implication is that live caption accuracy for proper-noun-heavy content will be lower than recorded-video accuracy, even with phrase boosting applied. For compliance purposes, live captions should be archived and corrected within 24–48 hours for content where accurate captions are a regulatory obligation. See our pages on Zoom captions for training, Microsoft Teams captions, and Webex captions for platform-specific implementation notes.
How do I know which of the 15 categories applies to my content before I build the glossary?
The fastest diagnosis is to run a 10-minute representative sample through Whisper at default settings (no glossary) and score it against a hand-corrected reference transcript using the DCMP protocol. Every error will fall into one of the 15 categories above — categorise each one, count by category, and build the glossary for the top 3 categories first. This audit typically takes 30–45 minutes for a skilled reviewer and tells you exactly where to invest. As a shortcut, use the per-vertical breakdown table in this post to identify the likely dominant categories for your content type — engineering, medical, financial, manufacturing, or other — and start the glossary from that starting point. You will usually catch 70–80% of errors by addressing the 3 highest-frequency categories for your vertical.
Is there a category that glossary injection makes worse, not better?
Yes: over-glossing common English words can degrade performance. If you add "record" to a medical glossary (because you are thinking of "medical record"), the model may start producing "record" in contexts where a different word was spoken — because the glossary biases the prior toward "record" in every segment. Only add terms that are genuinely low-frequency in general English and high-frequency in your domain. Test the glossary on held-out audio before deploying it, and compare DCMP scores with and without the glossary on a variety of segments — including segments that do not contain any glossary terms — to verify you are not introducing regressions on the general-English portions of your content.
What is the recommended process for keeping the glossary accurate as our product vocabulary evolves?
Treat the glossary as a versioned document, not a static file. Store it alongside your content authoring resources (in Notion, Confluence, or Google Docs — wherever your instructional designers already work). Define an update trigger: any new product, drug, chemical, regulation, or executive name that enters your training script should enter the glossary review queue. Assign a glossary owner (typically the L&D operations lead or a content-quality coordinator) who reviews new additions monthly and retires terms that have not appeared in new content for 12 months. Link the glossary owner to the caption-quality review log so they see the human-correction data — every correction is a signal that the glossary needs updating. The captioning RFP playbook includes a section on vendor glossary-management architecture and which vendor contractual terms govern who owns and maintains the glossary over the contract term.
Further reading
Category-specific deep dives
- Captioning medical training video: why Whisper mangles drug names and how to fix it — detailed audit of Category 3 (pharmaceutical INNs and brand names) in a 12-minute pharmacology refresher
- Captioning HazCom training: why SDS chemical names break ASR — detailed audit of Category 4 (IUPAC systematic names) and OSHA § 1910.1200(h) compliance implications
- Glossary-biased captioning: how a Whisper prompt beats YouTube auto-captions on engineering terms — engineering implementation of the decoder-side glossary injection (Stage 1 intervention)
- Why 99% caption accuracy matters: the WCAG 2.1 AA threshold explained — the DCMP measurement protocol for assessing accuracy improvement across the 15 categories
Pipeline and workflow
- The LMS caption ingestion workflow: bulk retrofit across TalentLMS, Docebo, Absorb, Kaltura — the generate → normalise → verify pipeline in detail, including the format-normalisation failures by platform
- Glossary vs. prompting vs. fine-tuning: how to actually decide for captioning in 2026 — when glossary injection reaches its limits and when per-customer compounding glossary models or LoRA fine-tuning is warranted
- The hidden half-FTE in your L&D budget: video caption correction costs — the labour-cost case for investing in pipeline-level proper-noun handling rather than absorbing the correction cost manually
Compliance and procurement
- How we ran a captioning vendor RFP: scoring sheets, vendor responses — how proper-noun accuracy scored across six anonymised vendors in a 17-named-entity test sample
- Captioning under a Joint Commission triennial survey — healthcare-specific compliance implications of Category 3 (drug names) and Category 7 (clinical nomenclature) errors
Vertical and platform reference pages
- Engineering onboarding captions — Categories 1, 2, 10, 13 in depth
- Medical training video captions — Categories 3, 7, 14
- HIPAA training captions — Categories 3, 5, 14
- Banking compliance captions — Categories 2, 5, 6, 14
- Manufacturing training captions — Categories 4, 5, 12, 15
- Construction safety training captions — Categories 5, 8, 14
- Cybersecurity training captions — Categories 1, 2, 5, 12, 13, 14
- Government employee training captions — Categories 2, 5, 9, 14
- FDA-regulated training captions — Categories 3, 4, 5, 14
- Safety training captions — Categories 4, 5, 14, 15
- Compliance training captions — Categories 2, 5, 14, 15
- Sales enablement captions — Categories 1, 2, 8, 11
- WCAG 2.1 AA captions — SC 1.2.2 and the 99% accuracy standard
- WCAG SC 1.2.2 Captions (Prerecorded) — the exact requirement
- Section 508 captions — federal contractor and agency obligations
- GlossCap live demo — see glossary injection in action on your own content