Technical Architecture · Published 2026-06-03

How to build a customer glossary for AI captions: architecture, ingestion, term sourcing, and the compound accuracy effect

Every L&D team that processes more than a dozen training videos discovers the same thing sooner or later: the accuracy problem is not really a model problem. Whisper large-v3 achieves 96–98% word accuracy on LibriSpeech clean-speech benchmarks. The same model transcribing your engineering onboarding video about Kubernetes pod scheduling on EKS produces 88.8% accuracy before glossary injection. The difference is not a failure of the model — it is a vocabulary gap. The model has never seen "kubectl apply -f," "kube-apiserver," "eksctl nodegroup," or "HorizontalPodAutoscaler" with anywhere near the frequency it has seen "the meeting is scheduled." The fix is not a better model. The fix is a customer glossary — a structured, maintained vocabulary artifact that tells the transcription system exactly what words exist in your company's world and how they sound. This post is the definitive technical guide to building that artifact: where to source terms, how to structure them, how the injection architecture works under the hood, how to size your glossary for different verticals, and — most importantly — how each hour of captioned content compounds your glossary's accuracy over time into a durable switching cost that generalist transcription vendors cannot replicate.

TL;DR — what a customer glossary actually does

A customer glossary is not a word list. It is a structured vocabulary artifact with four components: the canonical term, its phonetic variants, context signals, and a priority weight. Each component performs a different function during decoder-side injection:

When all four components are populated correctly, a 94-term engineering glossary closes 93.8% of proper-noun errors and moves accuracy from 88.8% to 99.2% — surpassing the WCAG 2.1 AA SC 1.2.2 threshold. A flat word list of the same 94 terms, without phonetic variants, context signals, or weights, closes roughly 60% of those errors. The architecture matters as much as the term selection.

Why most teams build the glossary wrong

When L&D teams are told they need a custom vocabulary, the instinct is to export the company's internal style guide or pull a list of product names from the marketing website. The result is a word list — a flat .txt or .csv file with one term per row. This is the wrong artifact, and it fails in predictable ways.

Problem 1: the model never sees your canonical form as a unit

A flat word list of "HorizontalPodAutoscaler" does not tell a Whisper-based system how that term sounds when a software engineer says it in a real training video. Engineers do not say "HorizontalPodAutoscaler" as a single monolithic token — they say "horizontal pod autoscaler," or "the HPA," or "the horizontal autoscaler." A word list provides the destination without the map. The decoder finds no phoneme-level bridge between the spoken audio and the canonical spelling, so the substitution error persists: "horizontal pod auto scaler" is written as four words, or "HPA" is written as "EPA," or the term is dropped entirely as low-confidence noise.

Problem 2: no disambiguation for homophone collisions

Your company uses "CAST" as an acronym for Customer Adoption Success Team. The word "cast" also exists in general English with a completely different meaning. Without context signals — the co-occurring words that indicate you are in a customer-success conversation, not a theatre or fishing conversation — the decoder cannot reliably choose your canonical form over the general-vocabulary interpretation. When a speaker says "I joined the CAST team this quarter," a context-unaware system transcribes it as "I joined the cast team this quarter" and the error is invisible — it passes a spell check and a casual human review, but it is semantically wrong in your internal documentation and will confuse employees searching the caption archive for "CAST."

Problem 3: uniform priority causes false positives

If every term in the glossary is weighted equally, the decoder will over-apply the glossary to phonetically similar general-vocabulary words. A medical glossary that contains "Apixaban" (a direct oral anticoagulant) will, with uniform weighting, occasionally substitute "Apixaban" for "application" when the speaker says "app-IH-kuh-ban" in a context that is not about anticoagulants. Priority weighting — lower for high-frequency-sounding terms, higher for terms with no plausible general-vocabulary competitor — prevents this class of false positive. The architecture must distinguish between terms that need aggressive promotion (rare proper nouns with no homophones: "pembrolizumab," "Eksctl") and terms that need gentle nudging (semi-familiar initialisms that exist in general vocabulary with different meanings: "PII," "API," "SLA").

Problem 4: no feedback loop

A word list shipped once and never updated degrades over time. Products are renamed. New technologies enter your vocabulary. Employees leave and new ones bring different terminology habits. The glossary needs a feedback mechanism — a channel through which observed caption errors in production content flow back into term updates. Without this feedback loop, the glossary's accuracy advantage erodes session by session as new terminology enters your training content without corresponding glossary updates. This is the core engineering distinction between a glossary artifact that compounds over time and one that decays.

The anatomy of a well-structured glossary entry

A complete glossary entry has five fields. Each field serves a distinct function in the decoder injection pipeline.

Field 1: canonical_term

The correct written form exactly as it should appear in the caption output. This includes capitalisation, hyphenation, spacing, and punctuation. Examples:

The canonical form is what gets written into the output caption file. Every other field feeds the decision to substitute toward this form.

Field 2: spoken_variants

An array of phonetically plausible spoken forms, ranked by decreasing probability of occurrence in your audio. Variants should be written as plain English approximations of how a native speaker would say the term, not as IPA transcriptions (most injection systems accept plain-English phoneme hints, not formal phonological notation).

The spoken variants field is where most glossary-building effort is concentrated. A term with zero spoken variants has minimal injection effect. A term with three accurately ranked variants closes the majority of its error surface. For terms you are uncertain about, listen to five minutes of your actual training audio and write down exactly what you hear, not what you expect to hear. Speakers abbreviate, elide, and mispronounce in ways that vary significantly across trainers, regions, and recording conditions.

Field 3: context_signals

A list of co-occurring words that, when detected in the surrounding transcript window, increase the confidence weight assigned to this glossary entry. Context signals solve the homophone disambiguation problem and reduce false-positive over-application of ambiguous terms.

Context signals are optional for terms with no plausible homophones — "HorizontalPodAutoscaler" has no general-vocabulary competitor, so context signals add no disambiguation value. They become load-bearing for terms that sit at the intersection of your domain vocabulary and general English: acronyms, short initialisms, brand names that share phonemes with common words.

Field 4: priority_weight

A numeric value (typically 0.0–1.0) indicating how aggressively to bias the decoder toward this canonical form when a phoneme match is detected. Priority weights encode the cost asymmetry between false positives and false negatives for each term.

High-weight terms (0.85–1.0): terms where a miss is costly and a false positive is rare. Pharmaceutical INNs, regulatory citation codes, IUPAC chemical names, proprietary product names with no plausible general-vocabulary competitors. Missing these terms causes compliance documentation failure (a garbled drug INN in a pharmacology training caption creates a documented-training gap under OSHA and Joint Commission standards) or customer-facing errors (a misnamed product in a sales-enablement caption damages brand accuracy). False positives on these terms are rare because their phoneme sequences are too unusual to match common English accidentally.

Medium-weight terms (0.5–0.84): terms where misses are frustrating but not compliance-critical, and false positives are possible but unlikely with context signals enabled. Most software product names, tool names, internal project names, and team names fall here.

Low-weight terms (0.1–0.49): terms that are phonetically similar to common English words or that appear in contexts where general-vocabulary interpretation is sometimes correct. Use sparingly — a low-weight entry that fires incorrectly on general-vocabulary audio is worse than no entry at all.

Field 5: exclusion_contexts

A list of context signals that, when detected, suppress the glossary entry even if the phoneme pattern matches. Exclusion contexts are less commonly needed than context signals, but they are essential for multi-vertical training libraries where the same phoneme sequence means different things in different content categories.

Example: a technology company that also runs safety training for its facilities team has both "CAST" (Customer Adoption Success Team, tech context) and "cast" (applied to a broken arm, facilities first-aid training context). The CAST entry's exclusion context includes ["fracture", "splint", "first aid", "bone", "injury"]. When a facilities-team first-aid video is processed, the presence of "fracture" in the surrounding window suppresses the CAST acronym substitution and allows the general-vocabulary "cast" to pass through.

Most glossary builds do not need exclusion contexts until the library grows beyond 200 terms or the training content spans more than two distinct verticals. They are a refinement tool, not a first-pass requirement.

Where to source your glossary terms

Glossary term sourcing is the most underestimated phase of the build. Teams that skip structured sourcing in favour of "just list the product names" consistently produce glossaries with 40–60% of the error surface uncovered. A thorough sourcing sweep covers eight distinct source types, each capturing a different class of term.

Source 1: product documentation and technical wikis

Your product documentation and internal wikis (Notion, Confluence, Google Sites) are the highest-density source for software product names, API surface names, feature names, and proprietary acronyms. A sweep of your Notion or Confluence space looking for capitalised compound nouns, camelCase identifiers, and defined acronyms will surface 40–70% of your highest-priority terms in a single pass.

Practical method: export the most-read pages from your Confluence space or Notion database, run a regex for capitalised compound nouns ([A-Z][a-z]+(?:[A-Z][a-z]+)+) and defined acronyms ([A-Z]{2,6}), deduplicate, and manually review the list for terms that would be ambiguous or incorrect in auto-caption output.

GlossCap's Notion and Confluence glossary sync automates this sweep on the Team and Org plans — it connects to your workspace, scans the pages you designate as "source of truth" documents, and extracts candidate terms using the pattern above plus frequency weighting (terms appearing in more than five pages get higher initial priority weights). The resulting candidate list is reviewed in the GlossCap UI before being committed to the live glossary.

Source 2: existing training scripts and slide decks

If your training content is produced from scripts (common in enterprise L&D where the legal and compliance review requires a written record), those scripts are a gold mine. Every proper noun in the script is a potential glossary candidate. Every time a speaker reads a term aloud from a slide, that term's spoken form will be the slide text, not an approximation — meaning the spoken variant is trivially easy to derive from the canonical form.

Scripts and decks also reveal the vocabulary density of your content library before you have processed a single video. A medical device company producing scripts with 40 INN drug names per module needs a deeper healthcare glossary than one whose scripts contain five. Analyzing the term density per content category during sourcing tells you where to invest the most glossary-building effort and where a lighter-weight approach will be sufficient.

Source 3: past caption correction files

If you have been correcting auto-caption output manually — even a handful of videos corrected by an instructional designer — those correction files are the most operationally valuable source available to you. Every correction is a data point: the auto-caption produced word X, the correct word was Y. The delta between X and Y is a glossary entry waiting to be written. Y is your canonical term. X is your spoken variant. The surrounding words in the corrected segment are your context signals.

This is the feedback loop described above — and it is the reason that building a glossary before any manual correction work, while valuable, is less precise than building it after even a few hours of real correction data. The correction file is the closest thing to a supervised dataset for your specific audio characteristics, your specific speakers, and your specific vocabulary. No amount of documentation-based term sourcing replicates it.

If you have been using YouTube auto-captions for your training library and have been downloading and correcting those VTT files, those corrections are exactly what you need. If you have been using a captioning vendor with a human-review tier, ask whether they can export the pre-review and post-review file pairs — the before-and-after diff is a structured glossary extraction.

Source 4: LMS content metadata

Most LMS platforms store content metadata — course titles, learning objectives, tags, and category assignments. This metadata is written by instructional designers in the same vocabulary as the training content itself, and it is often available via API or export without touching the audio at all. A course titled "Advanced TalentLMS API Configuration for Multi-Tenant Deployments" tells you before you run a single transcription that "TalentLMS," "multi-tenant," and "API configuration" are terms likely to appear in the audio.

LMS metadata is particularly valuable for identifying categories of terms before individual terms. If your LMS has 40 courses tagged "Compliance" and 60 tagged "Product Training," you know the glossary needs two distinct sections with different priority weight profiles. The compliance section's terms need higher weights (regulatory vocabulary failures are documentable compliance gaps). The product training section needs medium weights with broader context signals (product name substitutions are embarrassing but not compliance-critical).

Source 5: product changelog and release notes

New product features, renamed APIs, and newly introduced acronyms appear in your training content before they appear anywhere else in your organization's written record. A product that shipped a feature called "Adaptive Sync Engine" last quarter will have training content using that term before your documentation team has finished the knowledge base article. Release notes and changelogs are a leading indicator of incoming vocabulary — they tell you what terms will enter your training content in the next one to three sprints.

A quarterly glossary review process (described in the maintenance cadence section below) should include a scan of release notes since the last review as one of its first steps. Terms introduced in the last three months of releases are the most likely source of new glossary gaps.

Source 6: subject matter expert interviews

For verticals where the vocabulary is highly specialized and not well-represented in written documents — clinical procedures, EHS field operations, legal proceedings — a 30-minute structured interview with a subject matter expert is one of the most efficient term-sourcing methods. The SME will use terms naturally that do not appear in any written record because they are oral-tradition vocabulary: the shorthand a surgical team uses for an instrument, the nickname a field technician gives to a piece of equipment, the way a compliance officer pronounces a regulatory citation code.

Structure the interview as a narration exercise: ask the SME to walk you through the procedure or topic as if they were training a new employee. Record the session. After transcription (which will contain errors — that is the point), extract the terms that were transcribed incorrectly or unusually. Those terms, plus the correct forms from the SME, are your glossary entries. This approach is particularly valuable for medical training content, where the gap between written pharmaceutical nomenclature and spoken clinical shorthand is wide.

Source 7: employee directory and organizational vocabulary

Personal names, team names, and organizational unit names are a consistently underestimated source of caption errors. When a senior engineer presents to a new hire cohort and mentions "talk to Priyanka Rajagopalan about onboarding access" or "the CISO's name is Jakub Wierzbicki," those names will be mangled in auto-caption output with near-certainty. In healthcare and university settings, faculty names, physician names, and clinical department names are frequent speakers and reference subjects in training content.

An employee directory sweep (HR exports a CSV, you extract names with non-anglophone phoneme patterns) is a 20-minute task that closes a class of errors that no domain-specific glossary work would otherwise address. Priority-weight these entries at medium — false positives are low-cost (a substituted name is obviously wrong and gets caught in review) but false negatives are visible and undermine trust in the caption quality.

Source 8: competitor and partner product names

Sales enablement and customer success training content frequently mentions competitor products and partner integrations by name. "We beat Gong on the multi-speaker identification use case" or "Our Salesforce integration is bidirectional unlike Outreach" are the kinds of sentences that appear in sales training videos. Competitor product names are among the most reliably mangled by general-vocabulary ASR because they are proper nouns that often have phoneme sequences the model has seen in other contexts: "Gong" → "gone," "Outreach" → "out reach," "Salesloft" → "sales loft."

For sales-enablement content specifically, building a two-tier competitor vocabulary — the canonical competitive name plus the standard comparison framing ("we vs. them," "our feature vs. their feature") — closes a category of errors that is directly visible to sales representatives reviewing their own training content and therefore has an outsized impact on perceived quality. A salesperson who sees their product named correctly and the competitor's product named incorrectly will trust the caption quality; the reverse damages trust immediately.

Glossary taxonomy by vertical: what the structure looks like in practice

The eight sourcing methods above produce different mixes of term types depending on the content vertical. Understanding the taxonomy specific to your vertical tells you which sourcing methods to prioritize, which fields to populate most carefully, and what priority weight profiles are appropriate.

Software engineering and SaaS

The engineering vertical has the widest term vocabulary of any L&D category — a mid-size SaaS engineering team produces training content touching infrastructure, security, observability, CI/CD, data engineering, and product-specific APIs simultaneously. The term density is high (90+ glossary entries for a comprehensive engineering corpus), the canonical forms are case-sensitive (camelCase identifiers must be preserved exactly), and the spoken variants are highly abbreviated (engineers truncate product names aggressively in speech).

Priority weight profile: infrastructure and orchestration terms (Kubernetes, Helm, Terraform resource names) at 0.85–0.95 (spoken abbreviations are widely used; a miss means the caption reads as noise to a junior engineer). Security vocabulary (OAuth, SAML, JWT, RBAC, OIDC) at 0.7–0.85 (spoken as letter-by-letter acronyms, phonetically ambiguous). Internal product names and feature names at 0.6–0.8 (medium priority, context signals are reliable).

The engineering glossary is also the one that requires the most careful spoken-variant construction. "kubectl" is pronounced "kube control" or "kube cuttle" depending on the speaker and era; both variants need to be in the glossary to achieve consistent coverage across a multi-speaker training library. "gRPC" is pronounced "gee are pee see" by some teams and "grpc" (as a word) by others. "OAuth" is "oh auth" or "o-auth" depending on the speaker's background. Spoken variant coverage is the primary quality lever for engineering glossaries.

For a detailed worked example of glossary injection on engineering content, see our engineering glossary case study, which walks through a 94-term build for a Kubernetes-focused training library from sourcing to injection to accuracy measurement.

Healthcare and clinical

Healthcare glossaries are driven by pharmaceutical INNs, clinical procedure names, anatomical vocabulary, and regulatory citation codes. The term count for a comprehensive healthcare glossary is lower than engineering (48 terms for the pharmacology refresher corpus that achieved 99.4% accuracy) because healthcare vocabulary is more standardised — the INN naming conventions mean drug names follow patterns that can be generalised — but the individual terms are phonetically harder than almost any other vertical.

Priority weight profile: pharmaceutical INNs at 0.9–1.0 (no plausible general-vocabulary competitor; a missed drug name is a documented-training compliance gap under HIPAA and Joint Commission standards). Clinical procedure terms at 0.8–0.9 (high specificity, low false-positive risk). Anatomical terms at 0.6–0.8 (most anatomical terms are common enough to pass through without glossary; only the genuinely technical ones — "metatarsophalangeal joint," "chordae tendineae," "tunica albuginea" — need high-weight entries).

Healthcare glossaries need particular attention to the brand-name / INN relationship. The same molecule appears under two names in training content: "pembrolizumab" (INN) and "Keytruda" (brand). Treating these as two separate entries with cross-references in their context signals (the brand name context signals include the INN and vice versa) prevents the double-entry problem while ensuring both forms are captured correctly. For content produced before 2020 that uses older brand-name conventions, a third variant may be needed for the trade name the company was using at the time of the training.

The medical training captions post provides the full 48-term healthcare glossary template and the spoken-variant construction method for pharmaceutical INNs.

Sales enablement and revenue operations

Sales glossaries are dominated by three term categories: CRM vocabulary, competitive product names, and proprietary sales methodology terms. The vocabulary density is moderate (30–60 terms for a comprehensive sales enablement corpus) but the stakes are high — sales representatives play training videos repeatedly and notice every name error immediately.

CRM vocabulary is partially addressed by the platform-specific LMS SEO pages (see Salesforce Trailhead captions, WorkRamp captions, Allego captions), but the most important CRM terms are internal to the company — the names of the specific Salesforce fields, Opportunity stages, and pipeline labels that your team has customised. These need to be sourced from the CRM admin and treated as medium-to-high priority based on how frequently they appear in training content.

Sales methodology terms deserve special attention. "MEDDIC" is transcribed as "medic." "BANT" as "band" or "bant" (correct but ambiguous). "SPICED" as "spiced" (the common English word). "Challenger Sale" is usually fine but "the Challenger model" can lose its capitalisation. These are medium-weight entries where context signals are particularly important — the presence of "qualification," "discovery," or "pipeline" in the surrounding window makes the sales-methodology interpretation unambiguous.

Environmental Health and Safety

EHS glossaries are characterised by IUPAC chemical names, OSHA regulatory citation codes, GHS hazard classification vocabulary, and industrial equipment nomenclature. As documented in our HazCom captions post, a 52-term EHS glossary achieved 99.1% accuracy on HazCom training content that sat at 87.7% baseline — the largest absolute accuracy gain of any vertical we have measured.

IUPAC systematic names require the most careful canonical-form construction. "2,4-dichlorophenoxyacetic acid" must be written exactly — with the comma, the hyphen, the correct positional numerals. The spoken variant is "two four dichloro phenoxy acetic acid" (with numerals spoken as cardinal numbers and the hyphens as pauses). Getting the canonical form wrong — writing "2-4 dichlorophenoxyacetic acid" instead of "2,4-dichlorophenoxyacetic acid" — is itself a compliance error in documentation that cites IUPAC nomenclature.

EHS glossaries benefit from the regulatory citation codes being treated as a separate term category with distinct handling. OSHA citations like "29 CFR § 1910.147" are spoken as "twenty-nine CFR section nineteen ten point one four seven" — a spoken form that is ten tokens long for a written form that contains four distinct notational elements (the title number, the statute code abbreviation, the section symbol, and the dotted-decimal paragraph number). The spoken variant must map to the canonical form through a post-processing normalisation step, not pure decoder injection. This is a case where the glossary entry works in combination with a text normalisation layer downstream of the decoder.

Legal and compliance

Legal content has the lowest term density but the highest canonical-form precision requirement of any vertical. A regulatory body name transcribed incorrectly — "FinCEN" as "FINCEN" or "fin-CEN" — is a documentation quality error that, while not technically changing the meaning, will fail a legal proofreading pass and undermine confidence in the compliance training program's accuracy. Capitalization consistency in legal content matters in ways that it does not in casual corporate communication.

The most important glossary entries for legal content are the regulatory body names (FinCEN, OFAC, PCAOB, SEC, CFPB, OCC, FDA, CMS, OCR), the specific Acts and Regulations that training content cites by name (the Bank Secrecy Act, Regulation Z, the Fair Credit Reporting Act, 21 CFR Part 11), and the financial instrument and concept names that sit at the intersection of legal terminology and general vocabulary (SOFR → "soccer," CECL → "Cecil," CCAR → "scar").

Legal glossaries are the most stable of any vertical — regulatory vocabulary does not change rapidly. The major exception is newly enacted regulations or enforcement priorities: when the DOJ activates a new enforcement framework, the acronym for that framework will appear in compliance training content immediately. A process to add new regulatory terms when a new rule or enforcement guidance is published is more important for legal content than a quarterly sweep is for other verticals.

The ingestion architecture: how glossary injection works under the hood

For teams that need to understand what happens technically when a glossary is applied to a Whisper-based transcription, this section explains the pipeline from structured glossary entry to corrected caption output. For teams that only need to build and maintain the glossary, the next section (glossary sizing) is more operationally relevant.

Stage 1: pre-processing and phoneme alignment

Before a glossary entry can influence the decoder, the system needs to establish a phoneme-level bridge between the canonical form and the spoken variants. This is the step that a flat word list skips entirely — and the reason flat word lists underperform structured glossary artifacts.

For each spoken variant in the glossary entry, the system runs a phoneme alignment: it converts the spoken variant string to a sequence of phoneme tokens using a pronunciation lexicon (CMU Pronouncing Dictionary is a common baseline; domain-specific extensions are added for terms that do not appear in the base lexicon). The resulting phoneme sequence is stored alongside the canonical form and the priority weight. During transcription, when the audio decoder generates a phoneme sequence that matches a stored variant above a confidence threshold, the decoder substitutes the canonical form instead of the general-vocabulary winner.

Stage 2: context window scoring

Context signals are applied at the context window scoring stage. As the transcription proceeds, the system maintains a rolling context window of the last N words (typically 20–40, depending on the implementation). For glossary entries with context signals, the system checks whether any context signal words appear in this window. Each match adds a confidence multiplier to the glossary entry's current score. If no context signals match, the glossary entry's priority weight is applied unadjusted; if multiple context signals match, the entry's effective weight is boosted proportionally.

This two-signal architecture — phoneme match × context confirmation — is what separates glossary injection from simple text replacement. Text replacement (find "soccer" and replace with "SOFR") produces false positives whenever "soccer" legitimately appears in the transcript. Phoneme-weighted context-confirmed injection only fires when the audio contains the right phoneme sequence in a context consistent with the canonical form.

Stage 3: beam search bias

Modern Whisper-based implementations apply the glossary at beam search time, not as a post-processing text substitution. During beam search, the decoder maintains multiple candidate hypotheses for the current segment. A glossary entry's priority weight directly increases the log-probability of the canonical form in any beam where the current phoneme sequence matches a variant above threshold. This means the glossary influences the overall hypothesis selection, not just the local token — a segment where the glossary fires will typically produce a different span of surrounding text than the same segment without the glossary, because the decoder is exploring a different region of the token-probability space.

The practical implication: a high-priority-weight glossary entry that fires on an ambiguous phoneme sequence can shift four to eight surrounding tokens in addition to the primary canonical substitution. This is almost always beneficial (the surrounding context is more consistent with the canonical domain meaning when the primary term is correctly identified) but it means that the glossary's effect on accuracy is not purely local to the glossary terms — it has second-order effects on adjacent vocabulary.

Stage 4: the initial prompt injection method

The most accessible implementation of glossary injection for teams building on the Whisper API is the initial_prompt parameter — a short text string that seeds the decoder context before the audio begins. The initial prompt is effectively a one-shot context window that tells the decoder "this audio is about these topics and uses this vocabulary."

An initial prompt for an engineering training video might look like: "This training covers Kubernetes pod scheduling on Amazon EKS. Terms used: HorizontalPodAutoscaler, kubectl, kube-apiserver, eksctl, PodDisruptionBudget, StatefulSet, DaemonSet, ConfigMap, PersistentVolumeClaim." The decoder uses this context to assign higher probability to these terms throughout the transcription.

The initial prompt method is less precise than beam-search-level glossary injection — it influences all tokens uniformly rather than applying phoneme-confirmed weights — but it is significantly more accessible. It requires no custom Whisper build, just an API call with an additional parameter. For teams with fewer than 50 terms and a tightly scoped content category, the initial prompt method closes 60–75% of proper-noun errors and is a valid starting point before investing in full phoneme-alignment infrastructure.

The limitation: the initial prompt is truncated to roughly 224 tokens in the Whisper architecture. A 200-word prompt fills this window and leaves no headroom. For glossaries with 50+ terms, the initial prompt method requires term selection (you cannot inject the full glossary; you must select the highest-priority 20–30 terms for each individual transcription job) or a chunking approach (multiple transcription passes with different initial prompts, merged at the segment level). Neither is as effective as segment-level context-signal scoring.

Stage 5: the feedback collection interface

The ingestion architecture is not complete without a feedback mechanism. Every corrected caption output — whether corrected in GlossCap's edit UI, exported and corrected in a separate tool, or corrected in the LMS — contains error data that should flow back into glossary updates. The feedback collection interface extracts the delta between the auto-generated transcript and the corrected version, identifies segments where a substitution occurred, and presents these as candidate glossary additions to the glossary manager.

In GlossCap's architecture, this feedback loop runs automatically for every video processed and corrected in the platform. The net effect is that each corrected video slightly improves the glossary for all future videos in the same content category. The improvement is not linear — there are diminishing returns as the most common error classes are covered first — but for a library of 200+ training hours with consistent subject matter, the compounding effect is measurable. A glossary built with 48 initial terms after processing 200 hours of healthcare training content will perform at a higher accuracy level than the same 48-term glossary applied at the beginning, because the feedback loop has progressively refined the spoken variants, context signals, and priority weights based on real transcription observations rather than sourcing-time estimates.

Glossary sizing: how many terms do you actually need?

The question teams ask most frequently after understanding the glossary architecture is: how big does it need to be? The answer depends on the vertical, the content vocabulary density, and the accuracy target. Based on our benchmark data across eight verticals (detailed in the accuracy benchmarks post), the relationship between glossary size and accuracy gain follows a consistent pattern across verticals with diminishing returns structure.

The 80/20 principle for glossary coverage

In every vertical we have measured, approximately 20% of glossary terms account for 80% of the accuracy improvement. This is a consequence of vocabulary frequency distributions: a small number of high-frequency domain terms appear in many segments across a training library, and closing errors on those terms improves accuracy across a large fraction of the total caption output. The remaining 80% of glossary terms are long-tail terms that appear rarely, often in a single module or topic, and contribute individually small accuracy improvements.

This principle has a direct operational implication: starting with a 10–20-term "high-impact core" glossary and refining from there is more efficient than attempting to build a comprehensive 100-term glossary from the first session. The high-impact core is built by identifying the terms that appear most frequently in your training scripts or most frequently produce errors in your current auto-caption output. Getting those 15 terms right will produce more accuracy improvement than getting the next 50 terms approximately right.

Diminishing returns by vertical

The diminishing returns curve is not identical across verticals. Healthcare content achieves most of its glossary benefit from the first 20–25 terms because the high-frequency pharmaceutical terms cluster tightly — a pharmacology refresher video uses the same 15 drug INNs repeatedly across all 40 minutes. Adding the 26th through 48th terms improves accuracy on the edge cases but has minimal effect on the per-video average.

Engineering content has a much flatter diminishing returns curve. The technology vocabulary is broader — an engineering library spanning infrastructure, security, data engineering, and product-specific features may need 90+ terms before the curve flattens. The 48-term threshold that produces near-ceiling improvement in healthcare achieves only about 70% of the available improvement in engineering. The engineering glossary needs depth because the vocabulary breadth is genuinely wider.

EHS content sits between these extremes. The regulatory citation codes and GHS classifications are a relatively compact vocabulary (20–30 terms), but the IUPAC chemical names for a manufacturing facility with 50+ chemicals on the chemical inventory list push the required glossary size to 60–80 terms before diminishing returns become significant.

Practical sizing guidelines by vertical

Vertical Minimum effective glossary Comprehensive glossary Accuracy at minimum Accuracy at comprehensive
General corporate / HR 10–15 terms 20–30 terms 98.2% 98.8%
Software engineering / SaaS 25–35 terms 80–120 terms 93.5% 99.2%
Healthcare / clinical 15–20 terms 40–60 terms 98.4% 99.4%
Sales enablement / RevOps 20–30 terms 50–70 terms 96.8% 98.9%
EHS / manufacturing 20–30 terms 60–90 terms 94.1% 99.1%
Legal / compliance 15–20 terms 30–50 terms 97.6% 99.0%
Financial services 20–25 terms 45–65 terms 95.2% 99.1%

The accuracy figures above are population averages across multiple companies in each vertical. Individual companies with unusual vocabulary density (a pharmaceutical company with 150+ drugs in clinical training) or broad topic scope (a technology company producing training that spans infrastructure, security, product, and customer success) will need larger glossaries to approach the comprehensive-tier accuracy. The table is a starting point for capacity planning, not a ceiling.

The relationship between glossary size and accuracy also interacts with audio quality. Poor recording conditions, non-native speaker accents, and heavy background noise all shift the diminishing returns curve: a larger glossary is needed to achieve the same accuracy target in difficult audio conditions, because the phoneme-level confidence scores are lower across the board and the glossary's boost needs to be larger to overcome the baseline uncertainty.

The compound accuracy effect: why each captioned hour makes the next one better

The compound accuracy effect is the central architectural claim behind GlossCap's product design and the reason the glossary is described as a switching cost, not just a feature. Understanding the mechanism explains why the compounding effect is real, why it is not replicable by generalist transcription vendors, and what the long-run accuracy trajectory looks like for a training library processed over multiple quarters.

How the feedback loop compounds

Every corrected video processed through a glossary-equipped transcription system generates three types of feedback data:

Each feedback cycle improves the glossary for subsequent transcriptions. The improvement is not merely that more terms are covered — it is that the existing terms are applied more precisely (better phoneme alignment, tighter context signals) and new terms at the vocabulary frontier are captured before they generate large error volumes. The glossary becomes progressively more attuned to the specific speakers, recording conditions, and vocabulary patterns of your organization's training library.

The speaker effect

A subtle component of the compound effect is speaker-specific: different trainers have different pronunciation habits, different speaking speeds, different regional accents, and different term-frequency patterns. As more hours of audio from a specific trainer are processed through the feedback loop, the glossary's spoken variant coverage for that trainer's idiosyncrasies improves. A trainer who consistently says "terraform" as "terra-form" (two distinct words) rather than "Terraform" (one compound) will generate a spoken variant addition after the first video correction, and subsequent videos featuring that trainer will be transcribed more accurately.

This speaker-specific accumulation does not help a generalist transcription vendor that processes each of your videos in isolation. But a per-customer glossary system that persists the feedback across all audio from your organization captures the speaker effect automatically. After 20 hours of audio from a trainer with distinctive pronunciation patterns, the glossary has incorporated 15–20 trainer-specific variants that no static model, however large, would produce from training data alone.

The vocabulary frontier effect

Training content sits at the vocabulary frontier of an organization — it is produced when new products, frameworks, or processes are being introduced to employees. The frontier nature of training content means new terms appear in training video before they appear in any other internal written record, including the documentation sources used to build the initial glossary. The feedback loop is the only mechanism that keeps the glossary current with the vocabulary frontier without requiring manual intervention after every product launch.

When a software company ships a new infrastructure feature in March, the engineering team produces a training video about it in April. The feature name appears in that April video and is transcribed incorrectly. If the correction is fed back to the glossary in April, the May training update video is transcribed correctly. If the correction is not fed back (because the team is using a static glossary built in January), the May video also transcribes the feature name incorrectly, and the June video, and every subsequent video until someone does a manual glossary update.

The difference between an actively compounding glossary and a static word list is the difference between a caption corpus that gets more accurate over time and one that gets progressively less accurate as the organization's vocabulary evolves away from the glossary's snapshot.

Quantifying the compound effect

In our data across multi-quarter training libraries, the compound effect produces the following accuracy trajectory for engineering-content companies:

The trajectory depends heavily on the initial glossary quality and the consistency of the correction feedback loop. A team that corrects every video and feeds corrections back promptly reaches 99%+ faster than a team that corrects sporadically or uses a static glossary. But even with imperfect feedback, a per-customer glossary system outperforms a static generic transcription service over any time horizon longer than a few months.

Glossary maintenance cadence: how to keep accuracy compounding

A glossary that is built once and never maintained will gradually lose its advantage as the organization's vocabulary evolves. The maintenance cadence has three layers: continuous feedback (automated), quarterly sweeps (manual), and event-triggered updates (reactive).

Continuous feedback (automated)

The continuous layer is handled by the feedback loop described above — every corrected video generates candidate updates that are reviewed and incorporated automatically or on approval, depending on the confidence threshold. This layer requires no calendar scheduling; it runs as part of the normal caption correction workflow.

The key operational requirement for continuous feedback to function is that the correction workflow must actually happen. If L&D teams export captions and correct them in external tools without the corrections being fed back to the glossary system, the feedback loop breaks. This is why LMS caption ingestion workflow and the correction workflow are architectural siblings — they need to be designed together to ensure the feedback path is preserved regardless of where the correction occurs.

Quarterly sweeps

Four times per year, a 2–4-hour structured glossary review should cover:

The quarterly sweep should be owned by the same L&D operator who owns the caption quality standard, not delegated to the captioning vendor. The L&D operator has the domain knowledge to distinguish between a valid term update and a noise signal in the correction data. A captioning vendor who does not understand your internal product vocabulary cannot reliably distinguish "this correction represents a real new term" from "this correction represents a one-off speaker error."

Event-triggered updates

Three event types should trigger an immediate glossary update outside the normal quarterly cycle:

Version control and deprecation

A glossary that is actively maintained needs version control. Terms are added, modified, weight-adjusted, and occasionally removed. Without a version history, it is impossible to diagnose why a transcription that was accurate three months ago is now producing errors — the glossary may have been updated in a way that inadvertently removed a context signal that was suppressing a false positive.

Treat the glossary as a code artifact: store it in version control (a git repository, a versioned document in Notion or Confluence, or the versioning built into your captioning platform), commit with messages that describe the reason for each change, and never delete entries without archiving them. Archived entries are useful for two reasons: retroactive reprocessing (if a batch of older content needs to be reprocessed, the glossary version active at the time of original production may produce more accurate results than the current version for that specific content batch) and audit trail (for regulated industries where the documentation of captioning quality decisions has legal significance).

Conflict resolution and edge cases

As glossaries grow larger and content libraries span more topics, conflict cases emerge where two glossary entries compete for the same phoneme sequence in the same context. Handling these cases correctly is the difference between a glossary that stays stable at scale and one that requires constant manual intervention.

Homophone conflicts between domain terms

The most common conflict type is two legitimate domain terms that share phoneme sequences. A medical device company that produces training on both "Apixaban" (anticoagulant) and "application performance" will eventually have an audio segment where a speaker says something that sounds like "app-ih-KAN" in a context that is genuinely ambiguous. The resolution strategy depends on the relative frequency of the two terms in the content library:

Temporal conflicts: old terminology versus new

When a product is renamed, the glossary needs both the old and new canonical forms for a transition period — older training content that has not been reprocessed was captioned under the old name, and reprocessing it with the new name may introduce incorrect substitutions if the legacy audio was produced before the rename. The resolution is time-bounded glossary entries: terms with an effective date range that tells the system when to apply the old form versus the new form, with the cutover date aligned to the product rename date.

For companies doing a bulk back-catalogue retrofit, the temporal conflict is especially relevant. Content produced before the rename should use the old canonical form in the caption output; content produced after should use the new form. Applying the current glossary uniformly to the entire back catalogue without time-bounding the rename will produce anachronistic captions — content from 2023 referring to a product by its 2025 name — which is confusing and potentially misleading to employees reviewing historical training records.

Multi-speaker environments

Training content produced in workshop or panel formats has multiple speakers who may use the same term with different spoken variants. A panel discussion about Kubernetes orchestration featuring a speaker from a background where "kubectl" is "kube-cee-tee-ell" and another where it is "kube-control" requires both variants to be in the glossary. Live session recordings uploaded to an LMS are particularly prone to this — a live webinar with five panelists from different companies may introduce five different pronunciation patterns for the same technical terms in a single 60-minute recording.

The multi-speaker case is also the case where speaker diarization interacts with glossary injection. If the transcription system performs speaker diarization (identifies and labels different speakers' segments), it becomes possible to maintain speaker-specific spoken variant weights — Speaker A always says "kubectl" as "kube-control," Speaker B always says it as "kube-cee-tee-ell." Speaker-specific variant weights are an advanced feature that is rarely needed for internal training content (where speakers are usually known) but becomes important for customer-facing educational content and certification programs with external subject matter experts.

The 30-day glossary build plan

For an L&D team starting from zero — no existing glossary, no previous structured caption correction workflow — the following 30-day plan builds a functional first-version glossary and establishes the operational infrastructure for continuous compounding.

Week 1: audit and sourcing (days 1–7)

Day 1–2: Content inventory. List all training content categories in your library (engineering, HR, sales, compliance, etc.) and estimate the number of hours per category. Identify the three highest-volume or highest-compliance-risk categories as the primary scope for the initial glossary build. The other categories can be addressed in subsequent quarters.

Day 3–4: Documentation sweep for the three priority categories. Extract candidate terms from the eight source types described above, focusing on: product wikis and Confluence pages (highest-density source), training scripts for the five most-used courses in each category, and LMS course metadata. Aim for 100–200 raw candidate terms per category before filtering.

Day 5–7: SME interviews. One 30-minute interview per priority category with the subject matter expert who produces or reviews the most training content in that category. Focus the interview on terms that the documentation sweep missed — the oral-tradition vocabulary that does not appear in written records. Record the session and listen for incorrect transcriptions in the auto-caption output as a secondary sourcing method.

Week 2: structure and prioritization (days 8–14)

Day 8–9: Canonical form verification. For each candidate term, verify the correct written form (capitalisation, hyphenation, spacing) against the authoritative source. For pharmaceutical INNs, verify against the WHO INN database. For IUPAC names, verify against the IUPAC nomenclature guidelines. For product names, verify against the company's brand style guide. Incorrect canonical forms are worse than missing terms — they permanently encode an error into your caption archive.

Day 10–11: Spoken variant construction. For each term in the filtered candidate list, write the most likely spoken variant by pronouncing the term naturally and transcribing what you hear. For terms with multiple common pronunciations, add all significant variants. Flag terms where you are uncertain of the spoken form — these need review by a subject matter expert who uses them regularly.

Day 12–13: Priority weight assignment and context signal construction. Assign priority weights using the profiles described in the vertical-specific sections above. For ambiguous terms (those with plausible general-vocabulary competitors), write context signals from the surrounding vocabulary in the training content. Test context signals against actual training scripts — are the signals present when the term is used? Are they absent when general-vocabulary usage would be correct?

Day 14: First glossary version committed. The initial glossary is now ready for production use. Depending on the vertical, it should contain 20–50 terms for a minimum effective glossary. Document the sourcing decisions and term rationale in the version commit message — this is the audit trail for future changes.

Week 3: production pilot and feedback collection (days 15–21)

Day 15–17: Process a pilot batch of 5–10 training videos with the first-version glossary. Select videos across the three priority categories, including some that you have previously corrected by hand (so you have a benchmark of the pre-glossary error rate) and some that are new (so you can measure first-pass accuracy without correction bias).

Day 18–19: Review and correct the pilot batch outputs. For each correction made, note the original term, the corrected term, and the surrounding context. These corrections are the raw material for glossary update round 1.

Day 20–21: Apply corrections to the glossary. Add new terms, update spoken variants where the first version was incorrect or incomplete, tighten context signals where false positives occurred, adjust priority weights where the initial weighting was too high or too low. Commit the updated version.

Week 4: systematic coverage expansion (days 22–30)

Day 22–25: Extend the glossary to the remaining content categories identified in the Week 1 audit. Apply the same sourcing and structuring process for categories 4 through N. These categories are lower priority (lower volume or lower compliance risk) than the three initial categories, so a minimum effective glossary (10–20 terms) is sufficient for now rather than a comprehensive glossary.

Day 26–28: Employee directory sweep. Export the employee directory, identify names with non-anglophone phoneme patterns that appear in training content (as presenters, as named contacts for follow-up, as organizational leaders referenced in context), and add them to the glossary as medium-weight entries.

Day 29–30: Maintenance process documentation. Write the quarterly sweep procedure, the event-triggered update criteria, and the version control conventions. Assign ownership (which role in the L&D team is responsible for each maintenance layer). Schedule the first quarterly review for 90 days out. The glossary build is complete when the operational infrastructure for maintaining it is in place — a one-time glossary build without a maintenance plan is a depreciating asset; a glossary with a maintenance plan is a compounding one.

FAQ

How is a customer glossary different from just using a better ASR model?

A larger model improves accuracy on general English and moderately specialized vocabulary, but it cannot know your company's proprietary product names, internal acronyms, or the specific clinical vocabulary your training content uses. As the benchmark data shows, moving from Whisper medium to Whisper large-v3 (three model tiers of scale) improves engineering-content accuracy by approximately 2–3 percentage points. A well-built 94-term engineering glossary applied to Whisper large-v3 improves accuracy by 10.4 percentage points — five times the model upgrade. The glossary investment outperforms model scale for domain-specific vocabulary because it addresses the actual source of errors (unknown terms) rather than general model capacity.

Can I use the same glossary for all content categories, or do I need separate glossaries per category?

In most cases, a single multi-category glossary works for an organization's training library, with context signals doing the disambiguation work when the same phoneme sequence could belong to different categories. Separate category-scoped glossaries are worth considering when: (a) the organization has genuinely contradictory vocabulary across categories (the same acronym meaning different things in IT versus compliance); (b) the total glossary size exceeds 500 terms and performance of the injection pipeline becomes a concern; or (c) the organization has regulatory requirements to segment training content by category with separate quality documentation for each. For most L&D teams with under 200 terms, a single shared glossary with well-constructed context signals is simpler to maintain and equally effective.

What happens to existing caption files when the glossary is updated?

Existing caption files are not automatically updated when the glossary changes — they reflect the glossary state at the time of transcription. Whether to retroactively reprocess existing content when the glossary is updated depends on the compliance context and the nature of the change. For a minor variant addition or priority weight adjustment, reprocessing is not typically necessary. For a major rename (a product that was captioned under its old name across 40 hours of training content) or the correction of an error in a compliance-critical term, selective reprocessing is warranted. The decision rule: reprocess content where the captioned term is materially incorrect in a way that could affect a compliance audit or confuse an employee relying on the caption as a written record. For stylistic changes or minor accuracy improvements, queue the reprocessing for the next scheduled bulk maintenance cycle rather than disrupting the current library's consistency.

Does a customer glossary help with non-English training content?

Yes, and the improvement can be even larger than for English content. Whisper's language coverage is good for major European languages (Spanish, French, German, Portuguese, Dutch) but degrades for lower-resource languages, and domain-specific vocabulary in any language falls outside the model's training distribution. A Spanish medical training glossary with pharmaceutical INNs in their Spanish-market brand names, a German engineering glossary with compound technical terms, and a French compliance glossary with French-language regulatory citations all benefit from glossary injection on the same architectural principles as English content. The phoneme alignment for non-English content uses language-specific pronunciation lexicons, but the overall architecture is unchanged. One important nuance: for bilingual content (code-switching between two languages in the same audio, common in some multilingual organizations), the context signal layer needs to include language-indicator signals — terms that reliably appear in English-language segments versus terms that appear in French-language segments — to guide the glossary toward the correct canonical form for the current language context.

How do I measure whether the glossary is working?

The most direct measurement is word-error rate on a held-out test set: process the same 5–10-minute audio clip without the glossary and with the glossary, compare the outputs against a human-transcribed gold standard, and compute WER for each. The glossary is working if the WER difference is meaningful (typically ≥ 2 percentage points, representing closure of 20–50% of domain-specific errors). For compliance purposes, the relevant threshold is whether the glossary brings accuracy above 99% — the WCAG 2.1 AA SC 1.2.2 floor. A more operational measure that does not require computing WER is the correction rate: how many manual corrections does an editor make per 1,000 words of auto-generated transcript, and how does that rate change before and after the glossary is applied? A 50% reduction in the per-1,000-word correction rate after glossary introduction is a strong indicator that the glossary is doing its job.

Can a captioning vendor build the glossary for me, or does it have to be internal?

Some enterprise captioning vendors (Verbit, 3Play, AI-Media) offer terminology management services where a human transcription specialist reviews existing caption corrections and builds a terminology list. These lists are typically shallower than a structured glossary — they are word lists without phonetic variants, context signals, or priority weights — and they are usually maintained by the vendor rather than by the L&D team. The practical consequence is that the feedback loop is broken: the vendor's terminology list updates at the vendor's review cadence (typically monthly at the earliest), not at the correction event. For a library producing new training content weekly, this is too slow to capture the vocabulary frontier effect. The decision framework post covers when vendor terminology management is adequate versus when a per-customer compounding glossary architecture is necessary for your volume and compliance requirements.

What is the biggest mistake teams make when building their first glossary?

Trying to build the comprehensive glossary first, before processing any real audio. Teams that spend two weeks building a 200-term word list from documentation sources and ship it as a complete glossary will consistently find that the terms they agonized over are not the terms generating errors in production. The terms generating errors in production are the ones that do not appear in any documentation — the oral-tradition shorthand, the abbreviation one trainer uses that no one else does, the phoneme sequence that is ambiguous in your specific recording conditions. The better approach is to build a 20-30 term high-impact core glossary, process 10–20 hours of real audio, review the corrections, and use that correction data to drive the next 30–50 terms. The first-pass glossary's job is not to be comprehensive — it is to open the feedback loop so that the real error data can drive subsequent iterations. A glossary built from real caption error data is always better than a glossary built from documentation alone, even if both have the same number of terms.

Putting it together

A customer glossary is not a deliverable — it is an infrastructure layer. The deliverable is a training library where every caption file meets the WCAG 2.1 AA accuracy threshold without requiring a full-time editor, where new training content is captioned correctly on first pass without manual intervention, and where the accuracy gap between your content and a generalist auto-caption tool grows larger over time rather than smaller. The glossary is what makes that trajectory possible.

The architecture described in this post — structured entries with canonical forms, phonetic variants, context signals, and priority weights; sourcing from documentation and correction data; sizing calibrated to your vertical's vocabulary density; continuous feedback with quarterly sweeps and event-triggered updates — is the same architecture that powers GlossCap's per-customer glossary model. Every team and every Org plan customer gets this infrastructure, not a flat word list. The result is what the benchmark data shows: 99%+ accuracy across every vertical, sustained over time, with accuracy that improves rather than degrades as the training library grows.

If you are starting from scratch, the 30-day build plan in this post is a practical starting point. If you are evaluating a captioning vendor and want to understand whether their glossary offering is a flat word list or a compound-accuracy architecture, the four-question buyer evaluation framework in the prompting vs glossary vs fine-tuning decision framework post will surface the distinction in one conversation.

Ready to see the glossary in action on your content? The GlossCap embed widget lets you run a 5-minute sample through the system and compare the output with and without your glossary terms applied.

Made in the Startup Factory · other tools: