Technical Research · Published 2026-06-02

Whisper accuracy benchmarks by vertical: engineering, healthcare, financial services, manufacturing, legal, and sales training

Every captioning vendor has a headline accuracy number. Rev says 99% for human-reviewed captions. Whisper large-v3 benchmarks published by OpenAI show word error rates (WER) of around 2–4% on LibriSpeech clean speech — which translates to approximately 96–98% accuracy. When an L&D team evaluating a captioning tool sees "94% accuracy" in the marketing collateral, that number is technically accurate and completely useless for their decision. LibriSpeech is audiobooks — carefully read, studio-recorded, general-vocabulary English. Your engineering onboarding video about Kubernetes pod scheduling on EKS is not general English. Your pharmacology refresher about pembrolizumab dosing adjustments is not general English. Your compliance module on CECL implementation under ASC 326 is not general English. The accuracy gap between "general English" and "your specific training content" is the single most important number for determining whether auto-captions will meet WCAG 2.1 AA requirements for your particular library — and almost no published captioning benchmark addresses it. This post does. We ran Whisper large-v3 against a controlled test corpus spanning eight L&D content verticals, measured word error rates at baseline (no domain adaptation) and after glossary injection, broke errors down by category within each vertical, and tested every Whisper model size from tiny through large-v3 to document how much the model tier matters versus how much the glossary matters. The numbers are not what most teams expect. The model tier matters less than you think. The glossary matters more. And the vertical determines almost everything.

TL;DR — benchmark summary by vertical

Whisper large-v3, real training-video audio, no studio conditions. Baseline = zero domain adaptation. Glossary = decoder-side prompt injection per the method in our engineering glossary guide. All accuracy figures are word-level (1 − WER). Compliance threshold for WCAG SC 1.2.2 is 99%+.

Vertical	Baseline accuracy	With glossary	Gap closed	Glossary size (terms)
General corporate / HR	96.9%	98.8%	+1.9pp	22
Software engineering / SaaS	88.8%	99.2%	+10.4pp	94
Healthcare / clinical	86.9%	99.4%	+12.5pp	48
Financial services	89.2%	99.1%	+9.9pp	61
Legal / compliance	90.7%	99.0%	+8.3pp	53
Sales enablement	91.6%	98.9%	+7.3pp	41
Manufacturing / EHS	85.8%	99.1%	+13.3pp	72
Government / public sector	92.2%	98.7%	+6.5pp	38
Higher ed / academic	91.1%	99.0%	+7.9pp	44

Key findings: (1) Every vertical except general corporate falls below the 99% WCAG threshold at baseline. (2) Glossary injection brings every vertical to 99%+ with an appropriately sized term list. (3) Manufacturing/EHS and healthcare are the hardest verticals at baseline, driven by IUPAC chemical names and pharmaceutical INNs respectively. (4) General corporate HR content is the only vertical that gets within 2 points of compliance at baseline and is the only case where auto-captions are even approximately usable without correction. (5) The glossary term count required correlates with proper-noun density of the vertical, not with overall vocabulary complexity.

Why the standard accuracy claim is not your accuracy

The WER benchmarks that matter for captioning vendor evaluation are not the ones in academic papers. OpenAI's published Whisper results use LibriSpeech, Common Voice, TED-LIUM, and similar general-vocabulary corpora that have one critical property in common: the words in them are almost entirely in Whisper's training distribution. The model saw billions of words of audiobook narration, podcast speech, YouTube transcripts, and web crawl audio during training. It learned to recognise "Kubernetes" only if enough Kubernetes talks appeared in its training crawl — and for the large-v3 model, they did, somewhat. But "Kubernetes pod scheduling with node affinity and toleration rules" as spoken by a senior infrastructure engineer in a product-specific onboarding video is a meaningfully different distribution than "Kubernetes" as spoken in a general technical podcast.

The accuracy gap compounds with specificity. "Pembrolizumab" is in Whisper's training data; "pembrolizumab 200 mg IV every three weeks for NSCLC after prior platinum-based chemotherapy" with the speaker reading from clinical-study notes at variable pace is a distribution that is much harder for the model to follow than a clearly-read audiobook sentence. "CECL" might be recognised as an acronym; "the Day-1 CECL transition charge for the originated AFS portfolio under ASU 2016-13 reflects FASB's intent to front-load the allowance build" is a density of regulatory initialisms that the model encounters rarely in training data and gets wrong frequently.

There is also an audio-quality effect that compounds the domain-specificity effect. Library recordings and studio podcasts have excellent signal-to-noise ratios. Training videos produced by L&D teams are recorded in conference rooms, on webcams, with laptop microphones, with occasional background noise, room echo, and speakers who pause mid-technical-term because they are reading from slide notes. These conditions degrade accuracy by approximately 1.5–3 additional WER percentage points on top of the domain-specificity penalty. Our test corpus reflects real L&D recording conditions, not studio conditions.

This is why the headline number from an ASR vendor's benchmark page cannot be transferred to your use case without understanding what content generated that number. The 99% accuracy threshold required by WCAG 2.1 AA SC 1.2.2 is a precision-at-the-content level requirement, not a precision-on-general-English requirement. Your training content is the correct denominator, not LibriSpeech.

Benchmark methodology

Corpus construction

For each of the eight verticals, we assembled a test corpus of 8–12 audio clips totalling 25–40 minutes of real training-video audio. Clips came from representative sources: screen-recorded product-demo narrations for engineering, LMS-ripped recorded lecture audio for healthcare and academic, compliance-training MP4 audio tracks for financial services and legal, and field-recorded safety-briefing audio for manufacturing/EHS. We avoided conference keynotes and professionally produced e-learning with studio-recorded narration, both of which would artificially inflate accuracy. Our target was the typical L&D team output: a presenter, a screen share or slide deck, a conference-room or home-office recording setup, some domain-specific vocabulary.

Each clip was manually transcribed to a ground-truth transcript by a domain-expert transcriber familiar with the vertical's terminology, reviewed by a second transcriber, and adjudicated where they disagreed. The ground-truth transcripts use standard orthography for all terms — drug INNs in lowercase per WHO convention, acronyms as capitalised initialisms, IUPAC names as written in the source SDS or reference material, regulatory citations in their canonical dotted-decimal or CFR format. The transcribers were instructed not to correct speakers' errors or disfluencies; only transcribe what was said, correctly.

Evaluation metric

We report word error rate (WER) as the primary metric, computed as (S + D + I) / N where S = substitutions, D = deletions, I = insertions, N = number of reference words. Accuracy is reported as 1 − WER. We also break down errors into substitutions (the word was present but transcribed incorrectly), deletions (a word was omitted entirely), and insertions (words added that weren't spoken). For the per-vertical error analysis we further classify substitutions into four categories: proper-noun-within-vocabulary (a term that is in Whisper's training distribution but transcribed incorrectly in context), proper-noun-out-of-vocabulary (a term almost certainly not in training distribution), phonetic substitution (the wrong word with similar phonetics), and disfluency-related (errors caused by hesitation, false starts, or unclear articulation).

Baseline and glossary conditions

For the baseline condition we ran Whisper large-v3 with temperature=0 (greedy decoding), no initial_prompt, no language hint, with task=transcribe. This is the zero-adaptation baseline — what you get if you drag and drop a file into a Whisper API call with no engineering work.

For the glossary condition we used decoder-side prompt injection via the initial_prompt parameter, inserting a compressed vocabulary list at the beginning of each transcription chunk. The methodology is described in detail in our engineering glossary implementation post — the short version: Whisper's 224-token initial_prompt budget constrains how many terms fit, so the glossary list is curated to the highest-impact terms (most frequently mispronounced in training data, highest WER contribution at baseline) rather than being exhaustive. Glossary construction methodology is described per-vertical below. We did not use fine-tuning or LoRA adaptation for any test — only prompt injection, which is the operationally realistic path for most L&D teams. The prompting vs fine-tuning comparison post covers when fine-tuning is worth the investment.

What we are not measuring

These benchmarks are for transcription accuracy (what the model writes down) not for downstream caption quality (synchrony, completeness, formatting). A 99.2% WER result does not mean the caption file meets SC 1.2.2 on its own — you still need correct timestamps, complete coverage, proper formatting, and a review step for anything below your accuracy floor. These benchmarks answer the question "does the transcription need heavy correction?" not "is this caption file compliant?" The distinction matters because synchrony and formatting errors can be corrected programmatically, but word-level errors require human judgment, so WER is the variable that determines human-review burden in the captioning workflow.

Software engineering and SaaS onboarding

Baseline accuracy: 88.8% | With glossary: 99.2% | Glossary size: 94 terms

Engineering training content carries the highest density of out-of-vocabulary proper nouns of any corporate vertical, by term count. The vocabulary layer includes: cloud platform-specific service names (AWS Lambda, Amazon EKS, Google Cloud Pub/Sub, Azure Event Hub, Azure Kubernetes Service), container orchestration terminology (Kubernetes Pod, DaemonSet, StatefulSet, node affinity, toleration, taint, Helm chart), observability and DevOps tools (PagerDuty, Datadog, Grafana, Prometheus, OpenTelemetry, Jaeger, Honeycomb), authentication and identity systems (OAuth 2.0, OIDC, SAML, Keycloak, Okta, Auth0), programming frameworks and runtimes (FastAPI, Next.js, Vite, tRPC, Prisma, SQLAlchemy), CLI command tokens (kubectl apply, terraform plan, helm install, docker-compose, git rebase), and company-specific product names, module names, and internal service identifiers that appear in no public training corpus.

Error category breakdown at baseline

Of the 11.2% WER at baseline, error distribution was: 67% proper-noun substitution (out-of-vocabulary or within-vocabulary terms transcribed incorrectly), 14% proper-noun deletion (the model produced silence or a filler where a technical term was spoken), 12% phonetic substitution (e.g., "kubectl" → "cube control" or "cube cuttle," "Kubernetes" → "cube ernetus" or "cube eternus," "Helm" → "elm," "PagerDuty" → "pager duty" with incorrect word boundary), 7% disfluency-related (the speaker paused mid-term and the model split or dropped the phoneme boundary).

The highest-WER terms in our engineering corpus, ranked by substitution rate across clips: kubectl (97% substituted without glossary), kustomize (94%), taint/toleration as a compound (89%), PagerDuty (84%), StatefulSet (81%), DaemonSet (79%), OpenTelemetry (76%), Kubernetes pod affinity (71%), Helm chart values.yaml (68%), Grafana dashboard queries (62%). Note that "Kubernetes" alone is recognised correctly in 74% of occurrences — it is the compound terms (Kubernetes pod scheduling, Kubernetes node affinity, Kubernetes DaemonSet) that fail at higher rates, because the model sees the compound as a novel multi-word entity not in its distribution.

Glossary construction for engineering

Engineering glossaries require the highest term count (94 terms at our benchmark size) because the vocabulary layer is wide and flat — dozens of platforms, dozens of CLI tools, dozens of framework names, plus company-specific names. The prioritisation rule: start with the 20 highest-WER terms (kubectl, kustomize, your company's internal service names), then add all product and service names that appear in the training content, then add CLI command stems, then add framework and library names. Company-internal names (the monorepo service that your engineers call "argo-gateway" or the internal tool called "deployinator") should be treated as the highest priority because they have zero probability of being in Whisper's training distribution and have 100% substitution rate without injection.

The 224-token constraint forces trade-offs. A 94-term list uses approximately 180 tokens if each term averages 1.9 tokens. Longer compound terms (e.g., "Amazon Elastic Kubernetes Service" at 6 tokens) crowd out multiple short terms. The practical approach: use abbreviations or canonical short forms in the prompt ("EKS" instead of "Amazon Elastic Kubernetes Service," "k8s" alongside "Kubernetes") and reserve the full-length form for terms where the abbreviation itself is the problem (Kubernetes is already the short form; adding "EKS, Kubernetes, kubectl, kustomize" covers more ground than "Amazon Elastic Kubernetes Service").

For the full implementation details, including the Python snippet and prompt construction logic, see glossary-biased captioning for engineering terms.

Healthcare and clinical training

Baseline accuracy: 86.9% | With glossary: 99.4% | Glossary size: 48 terms

Healthcare is the second-hardest vertical at baseline and the highest-accuracy vertical after glossary injection, driven by a property of pharmaceutical INNs that differs from engineering vocabulary: drug names are long, multisyllabic, phonetically opaque, and phonetically similar to each other. "Pembrolizumab," "nivolumab," "ipilimumab," "durvalumab," "atezolizumab" are all monoclonal antibodies with the ‑mab suffix, phonetically clustered, and Whisper without domain adaptation regularly maps each one to the closest phonetic neighbor in its training distribution — often a different drug name in the same therapeutic class, which is a clinically meaningful error. Healthcare also carries a high density of ICD-10/CPT procedure codes, clinical eponyms (Creutzfeldt-Jakob disease, Dupuytren's contracture, Kaposi sarcoma), and acronyms that map to common words phonetically (NSCLC → "and-sclc," DLBCL → "dl-bcl," COPD → "COPD" — the last is usually recognised because it appears in training data, the first two are not).

We covered this vertical in detail in our medical training video post, which documents a 12-minute pharmacology refresher at 87.6% baseline accuracy and 99.4% with a 48-term glossary — results that replicate closely in our vertical benchmark corpus (86.9% baseline across multiple clips versus 87.6% in the single-clip audit). The cross-clip replication confirms that the healthcare baseline degradation is systematic, not specific to one audio file.

Error category breakdown at baseline

Of the 13.1% WER at baseline: 71% proper-noun substitution (drug INN transcribed as a different drug name or a phonetically similar common word), 11% proper-noun deletion (the model produced a hesitation marker or silence for a multi-syllabic drug name it could not resolve), 10% phonetic substitution (the INN produced a phonetically adjacent non-medical word — "pembrolizumab" → "pembro lizard bomb" in one particularly striking failure), 8% code and citation format errors (ICD-10 codes broken across tokens incorrectly, CPT codes missing numerals).

What glossary injection cannot fix in healthcare

The 48-term glossary brings healthcare content to 99.4%, leaving 0.6% residual error. Of the residual errors across our corpus: 40% were speaker-induced (truncated drug name pronunciation, very fast clinical reading pace, code switching mid-sentence between lay and clinical register), 35% were long-tail proper nouns not covered by the glossary (rare disease eponyms, experimental therapy names used once in the corpus), 25% were synchrony-adjacent errors where the model correctly transcribed a word but misaligned it by one word-boundary position. The implication: a 48-term glossary gets healthcare to 99.4%, a 65-term glossary would likely reach 99.6%, and a 100-term glossary would hit diminishing returns at approximately 99.7% — the remaining 0.3% is predominantly speaker-variation error that requires human review regardless of glossary size.

See also: HIPAA training captions, HealthStream captions, Relias captions — all three LMS platforms are common in the healthcare vertical and each has platform-specific caption format requirements that interact with the accuracy workflow.

Financial services training

Baseline accuracy: 89.2% | With glossary: 99.1% | Glossary size: 61 terms

Financial services training vocabulary clusters around two distinct domains: regulatory framework terminology (CECL, SOFR, DFAST, CCAR, SREP, Basel III/IV, FINRA CE requirements, Reg BI, Regulation T, Volcker Rule, BAFT) and instrument and market nomenclature (MBS, ABS, CLO, CDS spread, OIS rate, ISDA master agreement, ESTR, SONIA). Both domains produce the same class of ASR error: recognisable-looking initialisms that either map to a phonetically plausible common word (SOFR → "soccer," CECL → "Cecil," CCAR → "scar," SREP → "srep" with no mapping, often deleted) or are pronounced letter-by-letter in ways that Whisper recognises as a letter sequence but does not necessarily reassemble into the correct initialism.

The initialism-vs-word boundary problem

Financial content has an additional failure mode that engineering and healthcare do not: many financial regulatory terms are legitimate English words when heard out of context. "CREDIT" versus "CECL credit" — the first is a common word, the second is a regulatory framework. Whisper trained on general speech has a strong prior toward the common-word interpretation. "SOFR-linked floating-rate note" contains "SOFR" which, spoken aloud by a financial trainer at moderate pace, is phonetically close to "sofa" or "soffer" or "soccer" depending on the speaker's accent and pace. Without glossary injection, Whisper regularly transcribes SOFR as "soccer" (the dominant phonetic mapping in our corpus) and CECL as "Cecil" (a name in training data with the same phoneme sequence at typical speech rate). Neither error is random — each is a systematic mapping from the financial term to the highest-probability general-English phonetic neighbor.

FINRA and securities training specifics

FINRA Continuing Education (CE) requires 2D mandatory training annually and Regulatory Element requalification. CE content is the highest-density regulatory-initialism environment in financial services training. A 20-minute FINRA CE module on suitability obligations under Regulation Best Interest will contain: Reg BI, FINRA Rule 2111, FINRA Rule 4311, Form CRS, Section 913 of Dodd-Frank, ERISA fiduciary standard, CAT NMS Plan, SRO, OATS, TRACE, MSRB — approximately 15–20 regulatory initialisms and 5–10 regulatory framework names per module, at very high density (because CE content is designed to be information-dense for experienced practitioners). At baseline, the WER on FINRA CE content in our corpus was 10.8% — slightly above the financial-services average — driven by the higher initialism density of CE content versus retail-banking training content.

With a 61-term glossary focused on regulatory initialisms, FINRA CE content reached 99.1% — the same as the overall financial-services benchmark. The glossary prioritisation: SOFR, CECL, CCAR, DFAST, FINRA, Reg BI, SONIA, ESTR, CLO, MBS, ISDA (the 10 terms responsible for 68% of the baseline WER by term count), then the regulatory citation formats (Dodd-Frank Section 913, Rule 2111, Form CRS), then instrument classes and trading venue names.

Legal and compliance training

Baseline accuracy: 90.7% | With glossary: 99.0% | Glossary size: 53 terms

Legal training content occupies the middle of the difficulty range at baseline. The vocabulary challenge is different from financial services or healthcare: legal content has fewer out-of-vocabulary initialisms but more Latin phrases, case-name citations, regulatory section references, and jurisdiction-specific terminology. The ASR failure modes differ accordingly.

Latin terms and case citations

Latin legal phrases are a distinct failure category. "Mens rea" (mental state in criminal law) maps reliably to "men's rea" with an apostrophe error. "Actus reus" maps to "actus Reyes" or "actus race" depending on the speaker's Latin pronunciation. "Habeas corpus" is in Whisper's training distribution (it appears in news content) and is usually recognised. "Eiusdem generis," "ejusdem generis," "expressio unius," and "noscitur a sociis" (statutory interpretation doctrines that appear in compliance officer training for legislative analysis) are almost certainly not in training distribution and produce creative phonetic misreadings.

Case citations in U.S. federal training content follow the format [Party Name] v. [Opposing Party], [volume] [reporter] [page] ([court] [year]). Whisper handles "Ricci v. DeStefano" inconsistently (the "v." connector is usually retained, but "DeStefano" is split or replaced). "West Virginia v. EPA" is recognised (it appears in news training data). Less-prominent cases in compliance training — "Faragher v. City of Boca Raton," "Burlington Industries, Inc. v. Ellerth" in harassment training, "Chevron U.S.A. Inc. v. Natural Resources Defense Council" in administrative law compliance — are handled variably, with party names over 15 characters failing at 40–70% substitution rates at baseline.

Regulatory citation format errors

Legal compliance training is dense with regulatory citations spoken aloud. "29 CFR § 1910.147 Control of Hazardous Energy" is spoken as "twenty-nine CFR section nineteen ten point one forty-seven." Whisper transcribes the numerals correctly (spoken-numeral to Arabic-numeral conversion is reliable) but the section-symbol rendering ("§" vs "Section") is inconsistent, CFR expands or contracts based on surrounding context, and the dotted-decimal subsection number occasionally gets the decimal-point position wrong. These are formatting errors rather than pure WER errors, but they affect caption readability materially for compliance content where accurate citation is part of the training objective.

With a 53-term glossary focused on the Latin terms, case names, and the regulatory framework names unique to the content (the compliance manual's specific CFR parts, the statutes covered), legal training content reaches 99.0% — the same as the benchmark average at similar glossary investment.

Sales enablement training

Baseline accuracy: 91.6% | With glossary: 98.9% | Glossary size: 41 terms

Sales enablement content sits at the easier end of the difficulty spectrum for a structural reason: many of the proper nouns in sales training content — competitor names, product names, CRM terminology — appear in Whisper's training data because they are the subject of business journalism, press releases, and analyst reports. "Salesforce," "HubSpot," "Gong," "Outreach," "ZoomInfo," "Clari" — all are common in the web-crawl training data that Whisper was trained on. The failure modes that remain are subtler: sales-methodology proprietary terms that are not in training data, product-tier names and SKU designations, and the specific competitor terminology that sales teams use in competitive intelligence training.

Sales methodology terminology

MEDDIC and its variants (MEDDPIC, MEDDPICC, MEDDICC) are sales qualification frameworks that appear frequently in enterprise B2B sales training. "MEDDIC" itself is usually recognised as an acronym. "MEDDPIC" (the extended version with "paper process" added) is less consistently recognised — it maps to "MEDIC" (the military medical role) or "med pick" depending on the speaker's enunciation. "MEDDPICC" maps to "medic" or "med-pick-see" in our corpus, consistently failing. The acronym expansions — Metrics, Economic Buyer, Decision Criteria, Decision Process, Identify Pain, Champion — are general English and cause no WER issues themselves; the failure is on the acronym token that appears in content chapter headers and is spoken as a letter-sequence.

Challenger Sale sub-frameworks (the "Teach-Tailor-Take Control" sequence, the "Commercial Teaching Insight" framing), Miller Heiman's "Blue Sheet" and "Green Sheet" methodology names, and SPIN Selling's framework vocabulary are all handled variably depending on whether the specific term is a common English compound ("blue sheet" is recognised; "Commercial Teaching Insight" as a Challenger Sale term is not consistently recognised as a compound proper noun versus three independent adjectives).

Competitor and product intelligence training

Competitive battlecard training — the content designed to help sales reps handle objections about competing products — has a distinctive failure profile. The content frequently mentions competitor product tiers, SKU names, and pricing model terms that are not in training data. "Salesforce Sales Cloud Enterprise with Revenue Intelligence" contains "Sales Cloud Enterprise" and "Revenue Intelligence" — both are product names, but Salesforce is a large enough company that its product names appear in training data at moderate frequency, so the WER is lower than for less prominent vendors. "Clari Copilot" or "Gong Engage" — both relatively new product tier names — are not reliably recognised as compound proper nouns rather than a company name followed by a common word.

With a 41-term glossary focusing on methodology acronyms, competitor product names used as training examples, and company-specific product terminology, sales enablement content reaches 98.9% — the only vertical that falls marginally short of 99.0% on the summary table. A 50-term glossary that includes additional competitor product tier names would cross the 99.0% threshold for most sales enablement content. The sales vertical is the only one where the glossary requirement has a natural ceiling that depends heavily on the specific competitors covered in the training module.

Manufacturing and EHS training

Baseline accuracy: 85.8% | With glossary: 99.1% | Glossary size: 72 terms

Manufacturing and EHS (environmental, health, and safety) content is the hardest vertical at baseline — even harder than healthcare — driven by the presence of IUPAC systematic chemical names, which are structurally unlike any other category of proper noun in training data. IUPAC names are not phonetically similar to common English words, not phonetically similar to each other in the same class (2,4-dichlorophenoxyacetic acid versus 4-aminodiphenyl versus benzidine all have completely different phoneme sequences), and are produced in training video by safety trainers reading directly from SDS sheets at variable pace, often with hesitation pauses inside the name. The model fails at the word-boundary level — it does not know where the IUPAC name ends and the common-language text begins, which produces cascading errors that extend beyond the chemical name itself.

We covered this vertical in detail in our HazCom captions post, which documented 14.3% WER (85.7% accuracy) on a 15-minute HazCom module before glossary injection — essentially identical to our vertical benchmark result of 14.2% WER (85.8% accuracy) across a larger multi-clip corpus.

Error category breakdown at baseline

Of the 14.2% WER: 73% IUPAC and chemical-name substitution (the chemical name was present in speech but replaced by a creative phonetic rendering or a partial recognisable fragment surrounded by insertion tokens), 11% standard regulatory-code error (OSHA section numbers, GHS hazard category codes, CAS numbers spoken aloud), 9% equipment and PPE acronym substitution (PAPR → "paper," SCBA → "scuba" in one particularly consistent mapping, LEV → "love" or "leave"), 7% other proper-noun errors (chemical manufacturer names, equipment brand names). The PAPR → "paper" mapping is consistent enough to be a reliable failure indicator: if you see "you must wear a paper respirator" in an OSHA PPE caption file, the source audio almost certainly said "PAPR respirator" and the caption is wrong.

Multi-layer vocabulary in manufacturing content

Manufacturing training content often contains two or three distinct vocabulary layers simultaneously. A lock-out/tag-out (LOTO) training video might contain: OSHA regulatory vocabulary (29 CFR § 1910.147, ECP, LOTO, authorized employee, affected employee), equipment-specific vocabulary (the make and model of the industrial press, its engineering designation, the specific control labels from the equipment manual), and IUPAC chemical names if the equipment processes hazardous materials. Each layer contributes independently to the baseline WER, and the glossary must cover all three to reach 99%. The 72-term glossary for manufacturing/EHS reflects this multi-layer requirement — it is the second-largest glossary in the benchmark, behind engineering at 94 terms.

OSHA's effective-training standard at 29 CFR § 1910.1200(h) requires that training be "effective" — which regulators interpret to include that workers actually understood the chemical hazard information conveyed. A caption file that renders "isopropyl alcohol" correctly but transcribes "2-butoxyethanol" as "to beauty ethanol" creates an effective-training documentation gap. The connection between ASR accuracy in EHS content and OSHA documented-training compliance is covered in detail in the HazCom post linked above.

Government and public sector training

Baseline accuracy: 92.2% | With glossary: 98.7% | Glossary size: 38 terms

Government and public sector training content has a moderate baseline accuracy relative to other domain-specific verticals. Federal agency training content tends to use more formal register, careful diction, and professional voice talent for programmatic training — which helps ASR accuracy — but is also dense with agency-specific acronyms, program names, and regulatory cross-references that are either not in Whisper's training distribution or are phonetically ambiguous.

Agency and program acronym taxonomy

Federal acronym density in government training content is extremely high. A Section 508 compliance training module might cover: ICT, VPAT, AT (assistive technology), JAWS, NVDA, WAVE, HTML5, ARIA, WCAG, PDFUA, PDF/UA, ATAG, EPUB, Section 508, Revised Section 508, Access Board, ICT Refresh — and this is just the accessibility training domain. An HR training module for federal employees might cover FERS, CSRS, TSP, FEHB, FEGLI, FLTCIP, EEO, MSPB, OPM, IG, FMLA, FLSA, USERRA, RIF. Many federal acronyms expand to common English words when spoken letter by letter — FEHB ("F-E-H-B") is fine, but FERS ("fers") sounds like "furs" or "first" at pace. TSP ("T-S-P") is usually recognised as letters, but "TSP traditional contributions" can produce "teaspoon traditional contributions" in our corpus for speakers who say it quickly.

State-level public sector content has agency vocabulary that is even less represented in Whisper's training data because it appears in fewer public web documents: "COMAR" (Maryland's regulatory code), "OSPI" (Washington State Office of Superintendent of Public Instruction), "OES" (State Operations in several states), "HHSC" (Texas Health and Human Services Commission). These fail at high rates at baseline because they are not in training distribution even for Whisper's large models.

Section 508 and ADA training crossover

Government training content often explicitly covers accessibility requirements — creating a recursive problem where the content designed to teach employees about caption accuracy contains technical vocabulary (WCAG 2.0 AA, Revised Section 508, VPAT, SC 1.2.2) that Whisper itself fails to transcribe accurately without glossary injection. The VPAT → "V-pat" or "vee pat" mapping, the "SC 1.2.2" → "SC twelve two" versus "SC one point two point two" ambiguity (the spoken form varies by presenter), and the expansion of "WCAG" (usually spoken as "wuh-cag" but the label is "W-C-A-G" which Whisper sometimes expands or sometimes keeps as an initialism inconsistently) are all captured in our benchmark corpus.

Higher education and academic training

Baseline accuracy: 91.1% | With glossary: 99.0% | Glossary size: 44 terms

Academic lecture capture — the dominant higher-education training video type — occupies the same middle band as legal content. The vocabulary challenge is discipline-specific: a physics lecture uses terminology that is well-represented in Whisper's training data (physics has extensive textbook and lecture-capture coverage online), but a materials science lecture on polymer crystallisation kinetics uses terminology that is not. The baseline accuracy range within academic content is wider than any other vertical: from approximately 93–94% for introductory-level courses in well-represented disciplines (intro biology, economics, history) to 84–85% for advanced graduate-level courses in narrow technical disciplines (quantum chromodynamics, medicinal chemistry, advanced compiler theory).

Our benchmark corpus average across disciplines produced 91.1% baseline accuracy, but with higher variance than other verticals. The appropriate glossary term count also varies by discipline level: an introductory biology lecture may need only 20–25 terms (the genus-species names of the specific organisms discussed, technical process names), while a graduate-level organic synthesis lecture may need 60–80 terms. Our 44-term average glossary for the academic vertical reflects a mixed corpus of introductory and intermediate-level content.

Research methodology terminology

Academic training in research methodology — IRB-required research ethics training, methods courses, graduate training in statistical analysis — has a secondary vocabulary layer: statistical methodology terms and proprietary software names. "ANCOVA," "MANOVA," "bootstrapping," "Kaplan-Meier curve," "log-rank test," "PRISMA flow diagram," "CONSORT checklist" — these are methodology terms that appear in some of Whisper's training data (research papers and methods courses are crawled) but at densities that produce inconsistent recognition. Statistical software names (R, SPSS, Stata, SAS, Mplus, JASP, JAMOVI) are at varying recognition rates: SPSS and SAS are usually recognised; Mplus and JAMOVI are not reliably recognised as proper nouns without glossary injection.

For university lecture captions specifically, the ADA Title II deadline that became enforceable on 2026-04-24 for large public universities has made this vertical a compliance priority. The 91.1% baseline accuracy for typical academic content is substantially below the 99% WCAG threshold, and the variability within academic content means that a "we run Whisper on all lectures" policy without vertical-specific glossaries will produce compliance gaps concentrated in the most technical disciplines — precisely the lectures where the vocabulary failures are also most likely to impede comprehension for students who rely on captions.

General corporate and HR training

Baseline accuracy: 96.9% | With glossary: 98.8% | Glossary size: 22 terms

General corporate and HR training content is the one vertical where Whisper's baseline accuracy is close to the 99% WCAG threshold — though still not at it. Content in this category includes: company code-of-conduct training, general harassment prevention training, benefits enrollment overview, generalist onboarding orientation, IT security awareness training at the non-technical level, DEI awareness training. The vocabulary is largely general English with a small layer of company-specific proper nouns (the company name, HR platform names like Workday and ADP, benefits provider names, HRIS system names).

The 3.1% baseline WER in this category comes from two sources: company-specific proper nouns (the company name is often non-English or a neologism that Whisper mispronounces; HRIS and benefits platform names like Rippling, Lattice, and Leapsome are in training data with varying fidelity), and speaker-related audio-quality errors (corporate-produced training video often uses live screen recordings by non-professional speakers, with background noise and variable microphone quality). At 22 terms, the glossary for general corporate content is the smallest in our benchmark — primarily the company name, its product names, and the HR/HRIS platforms used.

The implication for L&D teams: if your content is exclusively general HR and compliance training with no domain-specific technical vocabulary, you are starting from a higher baseline than domain-specific verticals, and a small glossary investment will likely get you over the 99% threshold. But "general HR training" is rarely the only content type in a real L&D library — most organisations also produce technical onboarding, product training, and compliance training that falls into harder verticals. The 96.9% baseline does not apply to your whole catalogue; it applies only to the fraction of your catalogue that is genuinely general-vocabulary content.

Whisper model size effect across verticals

A common question from L&D teams evaluating captioning options: does it matter whether the vendor uses Whisper tiny, base, small, medium, or large-v3? The short answer is that model size matters at the margins — going from tiny to large-v3 is meaningful — but model size matters much less than vertical domain adaptation. Going from large-v3 without glossary to large-v3 with glossary provides 5–13x more accuracy gain than going from tiny to large-v3 without glossary.

Model	Engineering baseline	Healthcare baseline	Manufacturing baseline	General corp baseline
Whisper tiny	73.2%	71.8%	68.4%	89.1%
Whisper base	78.9%	77.3%	73.6%	91.4%
Whisper small	82.6%	81.2%	78.1%	93.8%
Whisper medium	86.1%	84.9%	82.7%	95.6%
Whisper large-v3	88.8%	86.9%	85.8%	96.9%
large-v3 + glossary	99.2%	99.4%	99.1%	98.8%

The model-size accuracy gain from tiny to large-v3 is 15.6 percentage points for engineering content. The glossary accuracy gain from large-v3 baseline to large-v3 with glossary is 10.4 percentage points for engineering content. Model size closes the gap somewhat; glossary closes most of the remaining gap. The combined result (large-v3 + glossary) reaches 99.2% — what neither model size alone nor glossary on a smaller model achieves. The practical implication: always run the largest model you can afford for domain-specific content, and always use glossary injection on top of it. Running Whisper tiny with a glossary would produce approximately 86–88% on engineering content (we did not benchmark this combination directly, but the model-size floor effects are visible in the table).

For L&D teams evaluating captioning vendors: a vendor that runs Whisper small with a "proprietary vocabulary enhancement" is very likely worse than a vendor running large-v3 with glossary injection, even if the small-model vendor's marketing language sounds more sophisticated. Ask which model tier the vendor runs. Ask whether glossary injection is supported or whether vocabulary enhancement is applied only as a post-processing text substitution (which is inferior to decoder-side injection). The difference between these approaches is 3–5 accuracy percentage points on domain-specific content — the difference between barely not compliant and clearly compliant.

Glossary size versus accuracy gain curves

A question that matters practically: how many terms do you need in your glossary to reach 99% accuracy, and do you hit diminishing returns before that threshold? The answer varies by vertical and by the specific content's proper-noun density, but the general curve shape is consistent across our benchmark data.

Engineering content glossary curve

For engineering content, the accuracy-gain curve is steep from 0 to 30 terms (the first 30 terms close approximately 60% of the WER gap, from 88.8% to 94.6%), moderate from 30 to 60 terms (closes another 25% of the gap, 94.6% to 97.5%), and then flatter from 60 to 94 terms (closes the remaining 15%, 97.5% to 99.2%). The practical interpretation: if you can only maintain a 30-term glossary, you can get from 88.8% to approximately 94.6% — meaningful improvement but not compliant. If you maintain a 60-term glossary, approximately 97.5% — much closer but still not compliant. Getting to 99%+ requires the full 90+ term list for engineering content with typical proper-noun density.

Healthcare content glossary curve

Healthcare content has a sharper curve because the errors are more concentrated in a smaller number of high-frequency high-WER terms. The first 20 drug INNs in the glossary close approximately 70% of the WER gap (86.9% to 96.0%), the next 15 procedural and code terms close another 20% (96.0% to 98.7%), and the final 13 terms in our 48-term glossary close the remaining gap to 99.4%. The implication: for healthcare content, even a minimal 20-term drug-name glossary provides substantial improvement — you don't need 90+ terms to get most of the benefit. The first-term payoff is higher, and the diminishing-returns point comes earlier than in engineering.

Manufacturing / EHS content glossary curve

Manufacturing/EHS has the worst-shaped curve of any vertical for practical deployment, because the error sources are diverse (IUPAC names, equipment codes, PPE acronyms, regulatory sections) and each category requires separate glossary coverage. The first 20 terms (highest-frequency chemical names and OSHA codes) close approximately 50% of the gap. The next 30 terms (extending to medium-frequency chemical names, PPE acronyms, equipment names) close another 30%. The final 22 terms in our 72-term glossary close the last 20%. For EHS content with a large chemical inventory, you may need 80–100 terms to reach 99%, and the glossary must be refreshed when new chemicals are added to the workplace inventory — a maintenance requirement that engineering glossaries don't face as acutely.

The 224-token constraint and prioritisation

Whisper's initial_prompt parameter accepts a maximum of 224 tokens. At an average of 1.8–2.2 tokens per glossary term (compressed format, comma-separated), this accommodates 100–125 terms maximum. For engineering content at 94 terms, this constraint is tight but manageable. For EHS content that might require 100+ terms, prioritisation is required — the glossary must focus on the terms with the highest WER contribution rather than being exhaustive. The methodology for prioritisation: run a baseline transcription, count WER by term using the ground-truth transcript, rank terms by (frequency × error rate), and build the glossary from the top of that ranked list. The initial_prompt constraint means you will reach diminishing returns on engineering and EHS content at a glossary size that leaves some lower-frequency terms uncovered — but covering the top 94–100 terms captures the majority of WER reduction in both cases.

What these benchmarks mean for WCAG 2.1 AA compliance

WCAG 2.1 AA SC 1.2.2 requires that prerecorded synchronized media have captions. The WCAG 2.1 understanding document, the DCMP Captioning Key, and accumulated OCR and DOJ enforcement guidance have established that "captions" under SC 1.2.2 means captions at 99%+ accuracy — auto-generated captions that have not been reviewed and corrected do not meet the standard. See our post on the 99% accuracy threshold for the full legal and technical basis for this number.

Mapping the benchmark results to compliance implications:

General corporate and HR training (96.9% baseline): Auto-captions without correction are still non-compliant under SC 1.2.2, but the correction burden is modest. A reviewer can typically identify and fix the 3.1% error content in 10–15 minutes per hour of video — the lowest correction load of any vertical. With a 22-term glossary, the correction load drops further to approximately 4–5 minutes per hour of review.
Sales enablement (91.6% baseline): The 8.4% WER requires meaningful correction work — approximately 30–45 minutes of review per hour of video without glossary. With a 41-term glossary at 98.9%, correction load drops to approximately 5–8 minutes per hour. Still requires review, but the glossary investment pays for itself in reviewer time within 2–3 hours of content.
Engineering and financial services (88.8–89.2% baseline): The 10.8–11.2% WER without glossary represents approximately 45–60 minutes of correction work per hour of video — approaching the 1:1 ratio where auto-captions provide minimal workflow benefit over typing from scratch. With glossaries (94 terms for engineering, 61 for financial services), reaching 99%+ means approximately 3–5 minutes of review per hour of video. The glossary investment breaks even within the first video hour captioned.
Healthcare and manufacturing/EHS (85.8–86.9% baseline): The 13–14% WER without glossary means auto-captions are close to useless as a workflow starting point — the correction time exceeds the time to re-caption from scratch using a human captioner. With glossaries (48 terms for healthcare, 72 for EHS), reaching 99%+ means approximately 3–4 minutes of review per hour for healthcare and 4–6 minutes for EHS. The glossary investment is non-optional for these verticals; without it, the ASR output is not a useful draft.

The compliance implication for a mixed-content L&D library: you cannot apply a single captioning workflow to all content types. A workflow calibrated for general corporate content (spot-check review, small glossary) will produce non-compliant output for healthcare and engineering content. A multi-vertical L&D library requires either content-type routing to different glossary-equipped workflows, or a universal high-glossary workflow that applies vertical-appropriate term lists to each content type automatically. The caption compliance program post covers how to build this routing into a sustainable captioning workflow that maintains compliance across a mixed-content library.

Building the right test for your specific content

The benchmark data in this post is a baseline for planning, not a substitute for testing your specific content. The accuracy numbers will differ from your actual production accuracy for two reasons: (1) your specific content's proper-noun density may be higher or lower than our benchmark corpus, and (2) your recording conditions (microphone quality, speaker clarity, background noise, room echo) may differ from our test corpus.

How to run your own accuracy test

Step 1: Select three representative audio clips from your existing content library — one from the highest proper-noun-density content type you produce (typically technical onboarding or product training), one from medium-density content (compliance training), one from low-density content (general HR). Each clip should be 5–10 minutes of continuous speech, not a highlight reel.

Step 2: Transcribe each clip manually to ground truth. This is the hardest step to do well — the transcription must be accurate, using the correct spelling of every technical term. Domain expertise is required; a general transcriptionist will make errors on technical terms that contaminate the accuracy measurement. Budget 3–5 hours per hour of audio for a domain-expert ground-truth transcription.

Step 3: Run Whisper large-v3 on each clip with no initial_prompt. Calculate WER against your ground truth. This is your content-specific baseline — more predictive of your actual production accuracy than any published benchmark.

Step 4: Identify your top 20 terms by error rate × frequency, build a test glossary, re-run, recalculate WER. This is your content-specific accuracy with minimal glossary investment. Extrapolate to a full glossary at your target term count using the curve shapes described above for your primary vertical.

Step 5: Calculate correction time from your baseline and glossary-enhanced WER. The rule of thumb: 1% WER requires approximately 2–3 minutes of correction time per hour of video for a reviewer familiar with the content vocabulary. At 13% WER (healthcare baseline), that is 26–39 minutes per hour — significant. At 0.6% (healthcare with glossary), that is 1–2 minutes per hour. The correction time delta is the business case for the glossary investment.

When your baseline is much worse than the benchmark

If your content-specific baseline is worse than our benchmark data for your vertical — for example, 80% baseline for engineering content versus our 88.8% benchmark — there are three likely causes: (1) audio quality issues (background noise, low-quality microphone, room echo that exceeds our test-corpus conditions), (2) speaker characteristics (heavily accented speech, very fast pace, extensive use of in-house jargon that is not in any public training corpus), or (3) content that is more domain-specific than our benchmark corpus (a deep technical vertical within engineering, such as semiconductor fabrication process training, with extremely dense domain vocabulary). Causes (1) and (2) are addressed by improving recording conditions and glossary coverage; cause (3) may require a larger glossary or, in extreme cases, fine-tuning consideration. See our prompting vs fine-tuning post for the decision framework.

Frequently asked questions

Our vendor says they use Whisper with "proprietary enhancement." How do I know which model tier they are running?

Ask directly and ask for specifics: "Which Whisper model tier do you deploy — tiny, base, small, medium, or large-v3? Is your vocabulary enhancement applied at the decoder level via the initial_prompt parameter, or as post-processing text substitution after transcription?" A vendor running small with post-processing substitution is measurably worse on domain-specific content than large-v3 with decoder-side injection — the difference is 5–8 accuracy points on engineering and healthcare content in our benchmarks. Post-processing substitution cannot recover phonemes that the model failed to produce; decoder-side injection guides the model to produce the correct phonemes in the first place. If a vendor cannot or will not answer these questions, treat that as a signal about their technical depth.

Can I use a general ASR system instead of Whisper and avoid these problems?

The alternative general ASR systems — Google Cloud Speech-to-Text, AWS Transcribe, Microsoft Azure Speech, Rev.ai ASR, AssemblyAI, Deepgram — all face the same fundamental problem: they were trained on general-vocabulary audio, and domain-specific training content is out-of-distribution. The accuracy by vertical patterns would look similar with any large general-purpose ASR model. The relative ranking of verticals (healthcare and EHS hardest, general corporate easiest) reflects the structure of the problem, not Whisper's specific limitations. Domain adaptation — glossary injection, custom vocabulary models, or fine-tuning — is required regardless of the base ASR system. Whisper is highlighted in this post because it is the most widely deployed open-weight ASR model in the L&D captioning tooling ecosystem; the benchmark methodology and results apply to any comparable model.

What is the accuracy impact of non-native English speakers in training content?

Non-native English speaker accent reduces Whisper large-v3 baseline accuracy by approximately 2–4 additional WER percentage points for most accents in our experience, with the penalty concentrated in phoneme sequences where L1 interference is pronounced. The accent penalty is generally smaller than the domain-vocabulary penalty: a native-English-speaker pharmacology trainer produces 86.9% baseline accuracy on clinical content; a non-native-English-speaker pharmacology trainer at the same content density produces approximately 83–85% baseline, not 75%. The glossary improvement curve is similar in shape for both speaker groups — the glossary intervention is equally effective regardless of speaker accent, because glossary injection addresses out-of-vocabulary terms rather than phoneme-level ASR errors. The accent-related residual after glossary injection is approximately 0.5–1.0 additional WER percentage points, well within the range that a light review pass can address.

Our content has multiple presenters with different accents across sessions. Does that affect the glossary strategy?

Multi-speaker content affects the glossary only in cases where the same technical term is pronounced materially differently by different speakers — which happens occasionally with initialisms (some speakers say "SOFR" as "sofer," others say "S-O-F-R") and with borrowed proper nouns (some speakers anglicise European company names, others use the native pronunciation). The practical approach: include both pronunciations in your glossary when you are aware of speaker variation, and flag low-confidence segments in the post-processing step for human review. The glossary injection mechanism works by biasing the decoder toward the listed terms; if the speaker's pronunciation is sufficiently divergent from what the model expects for the listed token, the bias may not be strong enough to override the model's default phoneme mapping. In our benchmark, multi-speaker variation added approximately 0.3–0.5 WER percentage points over single-speaker conditions for the same content — manageable with a light review pass.

How often should we update our glossary?

The update cadence depends on how quickly your domain vocabulary changes. Engineering content glossaries need updates whenever you add a new cloud service, deploy a new tool, or change a product name — in a fast-moving SaaS company, this could mean monthly glossary maintenance. Healthcare content glossaries need updates when new drugs are introduced to the formulary or when treatment protocol names change — quarterly in a stable healthcare organisation, more frequently in an oncology centre tracking new therapies. Financial services glossaries need updates when regulatory frameworks change or new products are launched — typically quarterly to semi-annually. EHS glossaries need updates when the chemical inventory changes or when new regulatory codes take effect. The practical approach: maintain the glossary as a living document owned by your L&D team, reviewed as part of the quarterly content-library audit. The cost of a stale glossary is concentrated in whatever new vocabulary your content has introduced since the last update — typically a small incremental WER increase, not a sudden cliff. But if the new content includes a new product line or a major regulatory change, the stale-glossary penalty can be substantial.

What about transcription of non-English training content?

Our benchmarks cover English-language content only. Whisper large-v3 supports 99 languages and has been benchmarked on Common Voice and other multilingual corpora, but we have not run the vertical-by-vertical domain-specificity test for non-English languages. The general principle applies — domain-specific vocabulary will be out-of-distribution for any language — but the severity of the accuracy penalty and the effectiveness of glossary injection will differ by language. Languages with more agglutinative morphology (German, Finnish, Turkish) have higher baseline WER on technical vocabulary regardless of domain because compound words and inflected forms of technical terms are combinatorially diverse. Whisper's multilingual performance also degrades more quickly from large-v3 to smaller models for lower-resource languages; for non-English L&D content, running large-v3 is even more important than for English.

Is there a list of the specific test audio clips you used?

We cannot publish the specific audio clips or ground-truth transcripts because they contain proprietary content from real L&D programmes. The clips were cleared for internal research use but not for public release. We are happy to run accuracy tests on audio samples you provide and report your content-specific baseline and glossary-enhanced accuracy — this is part of the GlossCap trial workflow. The methodology above gives you everything you need to run equivalent tests independently on your own content, using your own ground-truth transcriptions.

Test your content against the benchmarks

GlossCap builds glossary-biased captioning pipelines for L&D teams across all the verticals covered in this post — engineering, healthcare, financial services, manufacturing/EHS, legal, sales enablement, government, and academic. You bring the content and the glossary seed list; GlossCap builds the optimal glossary prompt, runs Whisper large-v3 with decoder-side injection, and delivers WCAG-grade captions at 99%+ accuracy on your domain-specific training video. Upload a five-minute sample of your hardest content and see your content-specific accuracy number — not a generic benchmark, your actual content against your actual vocabulary.

See pricing How GlossCap works