Technical · Published 2026-04-25

Glossary-biased captioning: how a Whisper prompt beats YouTube auto-captions on engineering terms

In the previous post we showed that scoring a nine-minute engineering onboarding clip under the DCMP Captioning Key produced a 91.4% accuracy from a vanilla Whisper-large run, and a 99.2% from the same model conditioned on a 36-term glossary. That 7.8-point lift is the difference between failing and passing a WCAG 2.1 AA audit. This post is the implementation companion: what the "glossary prompt" actually is, what it does to Whisper's decoder, how to build a useful one for engineering content, and the boring practical limits — token budgets, ordering, casing, and the small set of error classes prompting cannot fix. There is a working Python snippet at the bottom; if you only want the snippet, scroll there.

TL;DR

Whisper exposes a "previous-text" prompt that is fed to its decoder as if the model had just transcribed it, biasing beam search toward token sequences that look like a continuation of that text. You do not need to fine-tune anything. Pass a comma-separated list of in-domain terms (the OpenAI Python library calls it initial_prompt; the hosted API calls it prompt). The whole prompt has to fit in 224 tokens, which is roughly 30–50 short technical terms. Order matters less than you would expect; presence and casing matter more. The technique closes most "the model heard 'cooper Netty's' instead of 'kubectl'" errors, because in those cases the audio was genuinely ambiguous and the language prior was breaking the tie. It does not help when the audio is clean and the model just happens to be wrong, when the speaker introduces a new acronym mid-talk, or when two glossary terms collide acoustically (e.g. VPC and VPN). Build the glossary from the script if you have it, from the slide deck if you do not, and re-use it forever.

How Whisper actually decides what word it heard

Whisper is an encoder-decoder transformer. The encoder turns 30 seconds of audio (resampled to a log-mel spectrogram) into a sequence of vectors. The decoder is autoregressive: it produces the transcript one token at a time, where each token is conditioned on the audio embeddings from the encoder and on every token it has already produced. At each step, the decoder produces a probability distribution over its vocabulary, and a beam search keeps the top-N most likely partial sequences open until the end of the segment.

That last sentence is where glossary biasing lives. The decoder does not output one word at a time — it outputs a token, and a token in Whisper's vocabulary can be a whole word, a sub-word, a punctuation mark, or a special control token. When the audio for the word "kubectl" arrives, the decoder evaluates many candidate tokenisations: k_u_b_e_c_t_l as a sequence of byte-pair tokens, but also cooper followed by Netty's, plus dozens of other plausible sequences. Each candidate gets scored by combining the acoustic posterior (how well the candidate matches the audio) with the language-model posterior (how plausible the candidate is in English).

For words inside the model's training distribution — common English, audiobook narration, conversational speech — the acoustic posterior is sharp enough that the language prior barely matters. For words outside the training distribution — Kubernetes-flavoured shell commands, drug names, internal product codenames — the acoustic posterior is fuzzy. Two or three candidates have similar acoustic scores. The language prior breaks the tie. And the language prior was learned on the open internet, where "cooper Netty's" is more probable than "kubectl" because the open internet is mostly not Kubernetes documentation.

Glossary biasing is the trick of nudging the language prior toward your domain without touching the acoustic model. You change which tied candidates win without changing which candidates the model considered.

What the "previous-text" prompt actually is

Whisper was trained on long-form audio, and during training its decoder regularly received the previous segment's transcript as a context window — that is how the model maintains continuity across 30-second segment boundaries. The training objective therefore encourages the decoder to produce text that looks like a continuation of whatever it was just shown. Both the open-source openai-whisper Python package and the hosted OpenAI Audio API expose this slot to callers. The Python package calls the parameter initial_prompt; the hosted API calls it prompt. Both feed the string straight into the decoder's context window before transcription starts.

Critically, the model does not obey the prompt the way an instruction-tuned LLM obeys a system prompt. It treats it as plausible-prior text. If you write "Please transcribe carefully," Whisper does not become more careful — it just thinks the speaker has been talking about transcription and being careful, and may produce small biases around words like "transcribe" and "careful." This is why advice like "tell Whisper to use proper punctuation" works inconsistently: the model is not following an instruction, it is being primed.

The right mental model for the prompt is: pretend the speaker just said this sentence two seconds before the audio starts. What words become more probable as a continuation? That framing tells you exactly what to put in: nouns and noun phrases the speaker is likely to use, in the casing and punctuation you want them rendered in, in a style that resembles continued speech rather than a list of metadata.

The 224-token budget, and what to spend it on

The Whisper decoder reserves at most 224 tokens for the previous-text context. (The full context is 448 tokens; the other half is reserved for the segment being transcribed.) That is roughly 150–200 English words, or 30–60 short technical terms with light glue. Try to use all of it; do not try to exceed it. If you exceed it, the open-source library silently truncates from the front, and the hosted API rejects the request with a 400.

The most efficient packing is a single comma-separated list of the in-domain terms you want preserved, in the casing you want them rendered in, with the optional close-paren of "and similar terms." A real example for an AWS-flavoured engineering onboarding clip:

kubectl, EKS, ECS, ConfigMap, ConfigMaps, Fargate,
kubelet, ingress, ingress controller, RBAC, IAM,
IAM role, IAM policy, VPC, CIDR, subnet, NAT gateway,
ALB, NLB, Route 53, Pod, Pods, Deployment, Helm,
Helm chart, kustomize, Argo CD, Prometheus, Grafana,
node group, taints, tolerations, autoscaler, KEDA,
EBS, EFS, S3, and similar AWS and Kubernetes terms.

Three things are doing work here. First, the casing — RBAC not rbac, ConfigMap not config map. Whisper will copy the casing of the prompt onto the transcript when it picks the prompted token. Second, both singular and plural where they differ — ConfigMap, ConfigMaps, Pod, Pods — because the decoder treats them as different tokens. Third, the trailing "and similar terms" sentence, which is the cheapest way to make the prompt parse as natural English; without it the prompt looks like a CSV file and the model treats it less as a continuation.

You do not need a sentence per term. You do not need to explain what each term means. You do not need to write "the speaker may say:" — just list the terms. The empirical wins come from the simplest possible packing.

The Python snippet, end to end

Here is the full code that produced the 99.2% transcript in the previous post. It runs on CPU; on a modern desktop it finishes a nine-minute clip in roughly 4–6 minutes wall-clock with the large-v3 model. On GPU it is real-time or faster.

import whisper

model = whisper.load_model("large-v3")

GLOSSARY = (
    "kubectl, EKS, ECS, ConfigMap, ConfigMaps, Fargate, "
    "kubelet, ingress, ingress controller, RBAC, IAM, "
    "IAM role, IAM policy, VPC, CIDR, subnet, NAT gateway, "
    "ALB, NLB, Route 53, Pod, Pods, Deployment, Helm, "
    "Helm chart, kustomize, Argo CD, Prometheus, Grafana, "
    "node group, taints, tolerations, autoscaler, KEDA, "
    "EBS, EFS, S3, and similar AWS and Kubernetes terms."
)

result = model.transcribe(
    "onboarding-9min.wav",
    initial_prompt=GLOSSARY,
    condition_on_previous_text=True,
    temperature=0.0,
    no_speech_threshold=0.6,
)

with open("onboarding-9min.srt", "w") as fh:
    fh.write(whisper.utils.WriteSRT.write_srt(result["segments"]))

Three flags besides initial_prompt are worth understanding. condition_on_previous_text=True tells the decoder to keep using the previous segment's output as the prompt for the next segment, so the glossary continues to influence later passages even after the initial-prompt window has been overwritten. temperature=0.0 makes decoding deterministic; if you increase it for fallback retries on uncertain segments, do that with the hosted API's automatic-fallback policy in mind. no_speech_threshold=0.6 filters out segments the model thinks are silence or noise — useful for narration with long pauses, less useful for back-to-back speech.

If you are using the hosted OpenAI Audio API instead, the equivalent call is identical in spirit: client.audio.transcriptions.create(file=..., model="whisper-1", prompt=GLOSSARY, response_format="srt"). Same prompt, same effect, no local model. The pricing trade-off is documented in our Rev vs GlossCap walkthrough.

Why this is not the same as fine-tuning

The natural follow-up question is: why not fine-tune Whisper on your domain content and skip the prompt? The answer is that fine-tuning is the right move at one specific scale, and glossary-biased prompting is the right move everywhere else.

Fine-tuning Whisper-large requires a labelled dataset of audio-transcript pairs — typically 50+ hours of in-domain audio with hand-corrected transcripts to see meaningful gains, and 200+ hours to see them robustly. Producing 50 hours of labelled audio is a multi-week project for a vendor with a captioning team and a multi-month project for an L&D team without one. The compute to actually fine-tune a Whisper-large is non-trivial (on the order of a single A100-day per epoch, multiple epochs to converge); doable but a real spend. And the resulting model is locked to that domain — adding a single new product name later means a fresh fine-tune.

Glossary-biased prompting requires zero labelled audio, zero training compute, and zero per-update fine-tune. Adding a new product name is a single edit to the prompt string. The trade-off is that prompting only biases the language prior, which means it can resolve ambiguous audio toward the right token but cannot teach the model new acoustic shapes — if the speaker pronounces an internal codename so unusually that the acoustic posterior puts the right token at probability 0.001 even with bias, prompting will not save you. Fine-tuning would. This corner case shows up rarely in practice for engineering content; the underlying token shapes (kubectl, EKS, ConfigMap) are spelt and pronounced in a way that the model already considers plausible, just not most plausible. The ones the model has never seen at all are where fine-tuning earns its keep — fictional product codenames, niche internal jargon, languages the base model under-supports.

For the engineering onboarding and medical training verticals we focus on, prompting handles 90%+ of the closeable error categories. We use fine-tuning only for the largest customers whose vocabulary is genuinely outside the base model's distribution — which, for the 50–500-employee target ICP, is approximately none of them.

Where prompts cannot help

Glossary-biased decoding is not a universal fix. The error classes it does not close, in roughly the order you will encounter them:

Acoustically near-identical glossary collisions. If your glossary contains both VPC and VPN, the prompt cannot help the model decide which one the speaker said — it just shifts the prior toward "one of these two." The fix is upstream: tag your glossary by speaker section so unrelated terms are not in the prompt at the same time, or accept that this 1–2% of errors is a hand-correction job.
New terms introduced mid-talk. If a speaker says "we internally call this Project Foxglove" at minute 7 of a 30-minute talk and the glossary did not contain Foxglove, the prompt cannot retroactively help the first six minutes either. The right pattern here is a two-pass run: transcribe once with a base glossary, scan the output for capitalised unknown nouns, add them to the glossary, transcribe again. Two passes triple the cost; usually only worthwhile for high-watch-count assets.
Generic English errors. Whisper occasionally swaps "their/there" or splits a sentence at the wrong boundary; these are language-model errors with no domain hook. A prompt of technical terms does not help and may marginally hurt by drawing the prior away from the right English. Hand-correct or post-process with a separate punctuation-restoration pass.
Multi-speaker confusion. Whisper does not natively diarise (label which speaker said what). The prompt cannot help with speaker assignment. If you need speaker labels for DCMP scoring, run a separate diarisation pass (pyannote, or the diarise add-on of your provider) and merge.
Audio that is just bad. Compressed VoIP, room echo, overlapping speech — the acoustic posterior is so degraded that no prior moves the needle. Re-record or transcribe by hand.

Together, these residual errors are the difference between 99.2% and the theoretical 100% in our example. They are the kinds of errors that need either a human-in-the-loop review tier (which all incumbents charge for; see the 3Play vs GlossCap comparison for what that price step looks like) or simply being lived with as the rounding error of an audit-passing pipeline.

Building the glossary the first time

The glossary you ship with should be drawn from the customer's own materials, not invented. The fastest order of preference, from highest signal to lowest, is: existing terminology pages on the customer's wiki (Notion, Confluence, Google Docs); the internal style guide if one exists; the slide deck of the actual training video being captioned; the script if it survived the recording; and as a last resort, the first transcription pass with the unknown nouns extracted programmatically.

For training videos specifically, the slide deck is usually the highest-yield source per minute of effort. Slide titles and bullet points contain almost every term the speaker will say in domain-correct casing, often with surrounding parenthetical expansions for acronyms. Pulling the slide deck through a quick "extract unique nouns and acronyms" pipeline gives you a 30–60-term glossary in under five minutes, which is roughly the upper bound of what fits in the 224-token budget anyway. We default to this in our pipeline; the customer drops their slide deck into the upload alongside the video, and the glossary is built before transcription runs.

The glossary is the moat. It is not a one-time build; it accumulates per customer as more videos are processed, and it gets sharper as the pipeline learns which terms a given customer actually uses (versus which terms appeared on one slide once). Two months in, a customer's glossary covers their vocabulary so thoroughly that the marginal new asset transcribes correctly on the first try, and the only hand-corrections are the residual error classes above. That compounding is what makes this technique a product, not a one-shot tutorial.

FAQ

Does this only work with OpenAI Whisper, or do other STT engines support glossary biasing too?

Most modern STT engines expose some form of vocabulary or hint slot. Google Speech-to-Text has SpeechAdaptation phrase sets with per-phrase boost weights, AssemblyAI has word_boost, Deepgram has keywords with per-keyword intensities. The shapes differ — Whisper's prompt is text and biases the language prior implicitly; the others are explicit boost lists with explicit weights — but the underlying purpose is the same: nudge ambiguous audio toward your domain vocabulary. The Whisper approach scales the most cleanly for short prompts because text is dense; the explicit-boost approaches give you finer control at the cost of more configuration.

Should I include phonetic spellings in the prompt?

No, in almost every case. Whisper tokenises the prompt the same way it tokenises a transcript, and a phonetic spelling like "kube cuttle" becomes the tokens for "kube cuttle," not the tokens for "kubectl." If you put phonetic spellings in, the model becomes more likely to output the phonetic spelling. The exception is when a term is consistently mispronounced by the speaker in a way that does not match its written form — but at that point you are usually better off post-processing the transcript with a regex than fighting the prompt.

How big a quality lift should I expect on non-engineering content?

The lift scales with the gap between the model's training distribution and your content vocabulary. Engineering-onboarding captions show the largest gains because the technical-term density is high and the terms are rare in the open web. Medical training shows similar gains, dominated by drug and procedure names. Sales-enablement content shows mid-range gains, dominated by product and competitor names. Generic conversational content (a fireside chat, a leadership all-hands) shows the smallest gains, because the model's defaults are already close to right. Compliance training is mid-range — the acronyms (SOX, HIPAA, GDPR) help, but the underlying narrative is mostly plain English.

Is glossary-biased decoding the same as Retrieval-Augmented Generation (RAG)?

Conceptually similar, mechanically different. Both inject task-relevant context into a model's input to bias its output. RAG retrieves passages from a corpus at query time and inserts them into a generation prompt; glossary biasing pre-loads a curated terminology list into the decoder's context. RAG works at the document level over an LLM; glossary biasing works at the token level over a transcription decoder. The shared insight is that biasing a strong general-purpose model with the right context is usually cheaper and faster than training a domain-specific model from scratch.

Can I see this run on my own video before subscribing?

Yes. The embed preview shows the auto-caption-vs-glossary-caption side-by-side on a built-in dictionary; for a real run on your own asset, the Solo plan at $29/mo covers 5 hours per month and a paste-in glossary, and the Team plan at $99/mo covers 30 hours and Notion/Confluence/Docs glossary sync. Full pricing on the homepage. Or read the longer story of why we built this.

Why don't you publish a single open-source repo with the prompt and the scoring code?

The prompt template here is reproducible from this post — copy the GLOSSARY string, drop it into a Whisper call. The DCMP scoring is a thin wrapper over jiwer for word-level diff plus a manual formatting-error pass; we will publish that wrapper alongside the next post in this series. The non-trivial parts of our pipeline are the slide-deck-to-glossary extraction, the per-customer glossary versioning, and the WCAG 2.1 AA accessibility-statement generation — those are the product, not the prompt itself.