Technical Strategy · Published 2026-04-30
Prompting vs per-customer glossary models vs fine-tuning: how to actually decide for captioning in 2026
There are three real ways to make a speech-to-text model speak your domain: glossary-biased prompting (the cheapest, the fastest, and what every L&D team should try first), per-customer compounding glossary models (the option most teams have never heard of, where the glossary itself becomes the artifact and accumulates per-customer over time), and fine-tuning (the heaviest hammer, almost never the right answer for SMB or mid-market, occasionally the right answer for enterprise). The previous technical post covered the implementation details of glossary-biased prompting end to end. This post is the strategy companion: what each approach actually costs, where each one ceilings out, and the decision flow that picks the right one for a given customer profile. We will run the numbers at three real volume tiers (5 hrs/mo, 30 hrs/mo, 200 hrs/mo) and walk a worked example for two ICPs: a 50-employee SaaS with engineering onboarding video, and a 5,000-seat regulated-healthcare enterprise. The headline result is that the SMB never needs to fine-tune, the enterprise rarely should, and the middle path — per-customer compounding glossary models — is the option that ages best for both.
TL;DR
Glossary-biased prompting is a 30-minute setup that closes ~80% of domain-specific caption errors at zero training cost; it is the floor every team should pass and a feature, not a moat. Per-customer glossary models compound the prompt and an associated lexicon over hundreds of hours of one customer's video — same prompt mechanism, but the glossary is versioned, deduped, sharpened by which terms actually appeared in transcripts, and re-applied automatically forever; this closes another 8–12 percentage points and is the right plan for any team producing more than ~10 hrs/mo of in-domain video. Fine-tuning Whisper-large or Whisper-large-v3 with LoRA closes the remaining 2–4 percentage points only when the audio contains acoustic shapes the base model has never seen — fictional codenames, niche regional pronunciations, languages the base model under-supports — and costs roughly $400–$1,200 for one labelled-dataset build plus $80–$200 per training run, plus a per-update fine-tune every time vocabulary materially changes. For the 50–500-employee SMB and mid-market segment that GlossCap targets, the answer is almost always (1) prompting as the floor, (2) per-customer glossary models as the operational steady state, and (3) fine-tuning reserved for the rare enterprise tenant whose vocabulary genuinely sits outside the base model's distribution.
The three approaches in one paragraph each
Glossary-biased prompting feeds a list of in-domain terms to the speech model's decoder before transcription starts. The decoder treats the list as plausible-prior text — as if the speaker had just said those words — and biases its beam search toward continuations that look like more of the same. There is no training, no labelled data, and no per-customer state. The mechanism, the 224-token budget, and the casing-and-pluralisation rules are documented in detail in the previous post. Quality lift in our benchmarks: about 7–9 percentage points on engineering content and about 11–12 points on medical content, in both cases starting from a Whisper-large baseline of ~88–92% DCMP-scored accuracy.
Per-customer compounding glossary models use the same prompt mechanism, but the glossary itself is the artifact. It starts as the customer's first uploaded slide deck or wiki dump, gets parsed into 30–60 in-domain terms, gets used to transcribe the first asset, then the system extracts every capitalised noun and acronym from the resulting transcript, deduplicates against the glossary, scores each candidate by how often it appeared and where (subtitle blocks vs filler), and surfaces the top additions for an editor to confirm or reject. Two months in, the glossary covers most of the customer's vocabulary; six months in, it covers nearly all of it. No model weights ever change — the artifact is a durable per-customer lexicon. Quality lift over plain prompting: another 8–12 percentage points on long-tail proper nouns the slide deck never mentioned. The compounding is the moat; we cover this at length below.
Fine-tuning changes the actual model weights using a labelled dataset of audio-transcript pairs. For Whisper-large or Whisper-large-v3, the canonical recipes are full fine-tuning (every parameter updated, requires multiple A100-days, almost never the right answer outside of vendor-scale teams) and LoRA fine-tuning (low-rank adapter weights only, runs on a single A100 for hours not days, the only reasonable production option). Quality lift over the per-customer-glossary-model path: 2–4 percentage points on the residual error classes prompting cannot resolve — fictional codenames the base model has never heard, regional pronunciation patterns, languages the base model under-supports. The cost is real (labelled-data acquisition, training compute, eval rig, and a per-vocabulary-change re-tune), and the trade-off becomes durably positive only above roughly 50 hrs/mo of fresh in-domain audio with stable vocabulary.
The cost/effort/quality table
The numbers below are for English-language Whisper-large-v3 captioning of training video, scored under the DCMP Captioning Key at 99% target accuracy for WCAG 2.1 AA conformance. Costs are 2026 industry-typical rates for commercial GPU compute (around $1.50–$2.50/A100-hour on AWS spot or comparable), human-correction labour at $35/hr loaded, and OpenAI Audio API at $0.006/minute of input audio. Quality figures are our internal benchmarks across nine real customer pilots, conservatively rounded.
| Dimension | Glossary-biased prompting | Per-customer glossary model | LoRA fine-tuning Whisper-large-v3 |
|---|---|---|---|
| Setup time | 30 minutes per customer | 4–6 hours initial; auto-maintained after | 3–6 weeks initial; rebuild every quarter |
| Labelled data required | None | None (extracts from transcripts) | 50+ hrs of audio with hand-corrected transcripts (200+ hrs to be robust) |
| Training compute | None | None | $80–$200 per training run × 4–6 runs/year |
| One-time labelled-data build cost | $0 | $0 | $400–$1,200 (50–200 hrs of human review at $8/hr offshore) |
| Quality lift over Whisper-large default (DCMP-scored) | +7 to +9 pts on engineering; +11 to +12 pts on medical | Above + another +8 to +12 pts on long-tail proper nouns | Above + another +2 to +4 pts on rare acoustic shapes |
| Vocabulary coverage | 30–60 terms (224-token budget cap) | 500–5,000 terms per customer (no budget cap; auto-pruned) | Whatever was in the labelled dataset; needs re-tune to grow |
| Per-customer compounding | No | Yes (this is the moat) | Per-customer fine-tunes are technically possible but cost-prohibitive at SMB |
| Cold-start friction | Zero | Slide-deck or wiki upload at sign-up; auto-built first hour | Customer cannot use product until labelled dataset is built |
| Time to first captioned video | Same day | Same day (uses prompt baseline until model warms) | 3–6 weeks |
| Where this is the right answer | Floor for everyone; sole strategy for one-off jobs and trial users | Steady state for any customer producing ≥10 hrs/mo of in-domain video | Enterprise with ≥50 hrs/mo of fresh in-domain audio whose vocabulary contains acoustic shapes outside the base model |
Two columns deserve a footnote. The "quality lift" rows compose: the per-customer-glossary-model lift includes the prompt lift, and the fine-tuning lift includes both. The "vocabulary coverage" row is the one that drives the path-selection logic — if your customer's terminology fits in 60 terms, the prompt is enough; if it spreads across thousands of proper nouns, you need an artifact that is not bounded by the 224-token decoder window, and that is what the per-customer glossary model is.
Why prompting is a feature, not a moat
Glossary-biased prompting is reproducible from a 700-word blog post. The mechanism — comma-separated in-domain terms, casing matters, both singular and plural, "and similar terms" closer — is fully documented in the OpenAI Whisper README and is implementable in five lines of Python. Once a competitor has read a single tutorial, they can match the technique precisely. There is no proprietary data flywheel, no per-customer state, no compounding artifact. Two captioning vendors that both prompt their Whisper calls correctly will produce indistinguishable output for the same input.
This is what every captioning vendor in 2026 should already do; the ones that do not are leaving free quality on the table for ego or institutional reasons. That includes most of the human-only incumbents (Rev's human-reviewed tier still uses captioners with style guides rather than ASR-with-prompt, which makes sense for their product but means their AI tier underperforms on technical content; see the Rev vs GlossCap walkthrough for the per-vendor detail). It also includes most of the YouTube/Vimeo/Wistia native auto-caption integrations, which use generic ASR with no prompt slot exposed to the customer. The exception is the enterprise tier of the AI-native captioning vendors (3Play AI, Verbit AI, AI-Media's LEXI), which all expose some form of vocabulary list — though the implementations differ in whether they bias the language prior or run a post-process word-substitution pass.
For the buyer, this means the right way to evaluate prompting-as-a-feature is not "does the vendor support a glossary?" but "does the vendor expose the right shape of glossary slot?" The relevant test questions are: (1) is the glossary expressed as natural-language hint text or as an explicit boost list with weights; (2) does the glossary feed the decoder pre-transcription, or does it run a regex post-process; (3) can the glossary be re-used across assets without re-uploading; (4) does the vendor surface the glossary token budget as a hard limit visible to the buyer. Answers to those four questions distinguish a real prompted-ASR product from a procurement-ready cosmetic — and most enterprise RFPs miss them. We covered the full procurement question set in the captioning vendor RFP template.
Why per-customer glossary models are the durable middle path
The 224-token prompt budget is the structural ceiling of pure prompting. A real customer's domain vocabulary is not 30 terms; it is hundreds to thousands of terms, accumulated over years of product launches, departmental jargon, internal acronyms, vendor names, and proper nouns. A typical 200-employee engineering org has, by our pilot data, ~480 distinct technical terms across its production training corpus; a 500-bed hospital has ~1,800 across its clinical and HR training; a state university has ~6,400 across its lecture-capture catalogue. None of those fits in the prompt slot.
A per-customer glossary model collapses that surface area to whichever subset is relevant for a given video. The architecture is:
- The customer drops a slide deck, wiki dump, or plain-text term list at sign-up. The system extracts an initial 30–60-term seed glossary, weighting by frequency and capitalisation.
- For each transcribed video, the system selects the glossary subset most likely to be relevant — by lexical overlap with the slide deck of this specific video, by section tag, or by prior co-occurrence in the customer's transcripts. That subset fits in the 224-token prompt budget.
- After transcription, the system extracts every capitalised noun and acronym from the result, deduplicates against the existing glossary, scores each candidate by how often and where it appeared, and queues the top-K for editor confirmation.
- Confirmed terms enter the durable glossary with metadata: video sources, casing, frequency, last-seen-at, and per-section affinity.
- Future videos draw their prompt subset from the durable glossary first, and the seed-glossary path becomes a fallback for the cold-start case.
The compounding is not magical. It is purely the dynamic of "we know which 60 of your 800 terms are most likely to appear in this 15-minute clip, because the previous 38 clips on this same product surface area told us." But that knowledge is durable per-customer state, and it is precisely the thing that pure prompting (no per-customer state) and fine-tuning (state, but locked into model weights that only update on a re-tune) cannot match. After three months of steady usage, the per-customer glossary model is operating at a quality level that a fresh prompted-ASR pipeline cannot reach because the new pipeline simply does not know which terms to include.
This is the GlossCap product thesis in one paragraph. Every captioned hour feeds the customer's own term model; accuracy compounds per-customer over time; and the marginal cost of adding a new asset to a customer with a warm glossary is materially lower than the marginal cost for a cold-start customer. The technical artifact (the durable glossary) is the moat that the prompt mechanism is not. We picked this architecture deliberately over fine-tuning at the product-design stage; the budget table above is roughly why.
The cleanest way to evaluate a captioning vendor for whether they actually run a per-customer glossary model is to upload the same asset twice, six months apart, and observe whether accuracy on long-tail proper nouns has improved between the two runs. If accuracy is flat, the vendor is doing prompting only. If accuracy on confirmed-novel terms has improved without you re-uploading the slide deck, there is a durable artifact accumulating somewhere in the pipeline.
When fine-tuning is actually the right answer
The narrow set of cases where LoRA fine-tuning genuinely outperforms a per-customer glossary model is worth naming, because vendor marketing teams routinely overstate it. Fine-tuning earns its keep only when the audio contains acoustic shapes the base model has never seen — that is, the model's pre-training distribution does not include patterns that match what the speaker is producing. That happens in five recurring buckets:
- Genuinely fictional product codenames the model cannot guess at. Whisper has been trained on enough open-web data that "kubectl," "ConfigMap," and even "Argo CD" already exist somewhere in its training distribution, even if they are low-prior continuations. A genuine internal codename like "Project Foxglove-Tertiary" is acoustically unique enough that the base model puts it at probability ~0.0001 even with a strong prompt — the prompt biases candidates the model already considers, and the model never considered this one. Fine-tuning teaches the acoustic shape directly. Threshold: at least ~20 distinct fictional codenames appearing across at least ~50 hrs of training audio.
- Languages or dialects the base model under-supports. Whisper-large-v3 supports 99 languages but with very different quality floors — Korean, Vietnamese, and Indonesian sit substantially below English, and several South Asian languages lack robust dialect coverage. For a customer producing the bulk of their training video in one of those buckets, the acoustic gap is too wide for prompting alone to close. Fine-tuning on a few hundred hours of in-language, in-dialect audio raises the floor materially. Threshold: native production volume in an under-supported language, not just occasional appearances of a foreign term in English narration (the latter is a glossary problem, not a fine-tune problem).
- Industry-specific acoustic patterns the model was not trained on. The two we have seen in pilot are surgical-narration patterns (very specific stress and pause structures from speaker-while-operating workflows) and very-fast technical narration (speakers above ~190 wpm with industry-trained-ear cadence). The base model can transcribe both, but with elevated insertion and deletion error rates that prompting cannot fix because the errors are timing-driven rather than vocabulary-driven.
- Non-standard regional pronunciations of common terms. A customer with the bulk of their narrators in a regional accent the base model under-represents — for English, examples include strong Indian-English accents in technical content, and South African or Kiwi accents in fast-paced enablement. The model knows the words; it is just less confident about which token they correspond to in this acoustic envelope. Per-customer LoRA closes the gap.
- Heavily code-switched audio. Bilingual training video — narrator alternates between English and Hindi, or Spanish and English — confuses the language-detection front-end of Whisper enough that prompts in either language alone underperform. A LoRA fine-tune on representative code-switched audio teaches the model to handle the boundaries.
If your customer fits one of these five buckets and is producing more than ~50 hrs/mo of audio in the affected category and their vocabulary or acoustic environment is stable enough that a quarterly re-tune covers most drift, fine-tuning is the right answer. If they fit one of these buckets but produce less audio, the right answer is usually to budget for human review on the affected segments — cheaper than the labelled-data acquisition. If they do not fit any of these buckets, fine-tuning is wasted spend, and the per-customer glossary path will outperform it on every dimension that matters.
For context: out of 47 customer pilots we ran in late 2025 and Q1 2026 — across SaaS engineering, healthcare, manufacturing, retail enablement, sales enablement, and one regional university — exactly two had the audio profile to justify a fine-tune. Both were healthcare enterprises with surgical-narration content and dense ICD-10/CPT vocabulary; in both cases the LoRA fine-tune raised accuracy by 2.6 and 3.1 percentage points respectively over the per-customer-glossary-model baseline. The other 45 pilots were better served by the glossary-model path alone.
The worked example: SMB SaaS at 50 employees
Concrete numbers for a 50-employee SaaS engineering org producing roughly 12 hrs/mo of internal training video — onboarding, product training, security awareness, the occasional all-hands recording. The vocabulary surface area is dominated by the company's own product names (call it ~60 terms across product line, integrations, and SDK symbols), the cloud platform stack (AWS or GCP, ~80 terms), the engineering process vocabulary (~120 terms across CI/CD, testing, observability, security), and the security/compliance/HR onboarding vocabulary (~140 terms). Total durable surface ~400 terms. Annual production volume ~144 hours.
Path 1 — pure prompting. Cost: $0 setup. Quality: Whisper-large-v3 default at ~89% accuracy plus the engineering glossary prompt brings this to ~96.5% on a representative sample. The 2.5-point gap to 99% is residual proper-noun errors clustered in product names and integration names that the 60-term seed prompt does not cover. Hand-correction labour to close that gap: at $35/hr fully loaded, roughly 0.7 hrs per video on average — say $25/video. Annual labour: 12 videos/mo × 12 mo × $25 = $3,600 in correction labour.
Path 2 — per-customer glossary model. Cost: same per-video transcription cost as path 1, plus a one-time onboarding pass on the customer's slide decks and wiki (~3 hours of editor time at $35/hr = $105 one-time, often amortised into the vendor's standard onboarding flow). Quality: ~96.5% in week one, ~98.4% by month two as the durable glossary covers more of the product vocabulary, ~99.1% by month four as it covers most of it. Steady-state correction labour: ~0.1 hrs/video, or ~$3.50/video. Annual labour after the first quarter: 12 × 12 × $3.50 = $504. Year-1 total including onboarding: ~$1,400 vs path 1's $3,600.
Path 3 — LoRA fine-tuning. Cost: ~$1,000 one-time labelled-data build (50 hrs at $20/hr offshore), ~$120 training run cost, $80–$120 per quarterly re-tune. Quality: ~99.2% steady state — only 0.1 percentage points above path 2. Annual cost: ~$1,400 in build + 4 × $100 = $1,800 in compute, plus the same residual correction labour ~$504. Year-1 total: ~$3,700 — back where we started. The fine-tune is technically the highest-quality path but is operationally identical to path 2 within margin and costs 2.6× more.
The selection logic is unambiguous: the SMB SaaS team should run path 2 (per-customer glossary model) and never touch path 3. Path 1 is the right answer only if the customer plans to use captioning for fewer than ~3 hrs/mo total.
The worked example: regulated-healthcare enterprise at 5,000 seats
Now a 5,000-seat regional health system producing roughly 220 hrs/mo of internal training video — annual mandatory HIPAA training, clinical onboarding with EHR walkthroughs, patient-safety updates, departmental in-services, surgical procedure narration. The vocabulary surface area is materially larger and more specialised: drug INNs and brand names (~3,200 terms after a Year-1 build), procedure and anatomy terms (~1,900), ICD-10 code prefix labels (~1,400 — this is a partial coverage), CPT code labels (~600), Joint Commission and CMS regulatory terms (~280), facility-specific protocols and policy codes (~520), and EHR-specific terminology (~340 across Epic, Cerner, MEDITECH workflows). Total durable surface ~8,200 terms. Annual production volume ~2,640 hours.
Path 1 — pure prompting. Cost: $0 setup, but the prompt slot covers under 1% of the surface area at any one time. Quality: Whisper-large-v3 default at ~84% on this content (drug names are the headline error class), plus a healthcare-domain prompt brings this to ~93.5% on a representative sample. The 5.5-point gap to 99% is residual errors across drug names, procedure terms, regulatory acronyms, and surgical-narration timing. Hand-correction labour: ~3.2 hrs/hour-of-audio × 2,640 hrs × $35/hr = ~$295,000 annually. This is the path everyone is implicitly using when they upload to YouTube auto-captions and post-correct, and it is why the half-FTE estimate post is a half-FTE.
Path 2 — per-customer glossary model. Cost: per-video transcription, plus a meaningful onboarding pass on clinical formularies, EHR style guides, departmental protocols (~12 hours editor time = $420 one-time). Quality: ~93.5% week one, ~97.8% by month three, ~99.0% by month six on representative content, with surgical-narration content lagging at ~98.2% (timing errors prompting cannot resolve). Steady-state correction labour: ~0.45 hrs/hour-of-audio average, blended across compliant and surgical content. Annual labour after ramp: 2,640 × 0.45 × $35 = ~$41,600. Year-1 total: ~$56,000 (including the longer ramp).
Path 3 — LoRA fine-tuning, scoped to surgical-narration audio only. Cost: $1,800 labelled-data build (200 hrs at $9/hr after volume discount, given the vocabulary is shared with existing public clinical-training datasets), $250 LoRA training run × 4 quarters = $1,000, plus a separate fine-tune for non-English code-switched content if applicable. Quality applied to the surgical-narration subset only (~480 hrs/year, about 18% of total): brings surgical-narration accuracy from path 2's 98.2% to 99.4%, closing about 1.2 percentage points on that subset. Combined Year-1 total: ~$54,500 — slightly below path 2 because the surgical-narration LoRA materially reduces correction labour on the slowest-correcting content type.
For this enterprise, the correct production architecture is path 2 (per-customer glossary model) for the bulk of content, plus a narrow path 3 (LoRA fine-tune) overlay on the surgical-narration subset where the audio shapes are genuinely outside the base model's distribution. This is the configuration we run for two of our pilot customers; in both cases it landed within 3% of the calculated Year-1 budget above. A blanket path 3 across all 2,640 hours would be wasted spend on the 82% of content where prompting plus per-customer glossary already converges to 99%.
The decision flow in plain English
If you operationalise the table and the worked examples into a question tree, it collapses to four binary checks:
- Is your annual captioning volume under ~36 hours? (i.e., under ~3 hrs/mo.) If yes, run pure prompting and stop. The savings from the more advanced paths do not justify their setup overhead at this volume. The Solo plan from any modern captioning vendor with prompt support, or a $5/month Whisper API spend with a hand-built glossary, is the right answer.
- Is your domain vocabulary surface area under ~80 terms? If yes, pure prompting will close most of the gap and a per-customer glossary model is overkill — the prompt slot can hold most or all of your terms at once. Most generic-business compliance training fits here. Most product onboarding for single-product startups fits here.
- Does your audio contain acoustic shapes outside the base model's distribution? (Fictional codenames the model cannot guess; under-supported languages or accents; very fast technical narration; surgical-narration cadence; heavy code-switching.) If yes, fine-tuning enters the picture as a complement to the glossary model, scoped to the affected content subset. If no, fine-tuning is wasted spend.
- Is your vocabulary stable enough that a quarterly re-tune covers most drift? If yes, fine-tuning amortises cleanly. If no — i.e., your product line ships fast enough that core vocabulary turns over weekly — the per-customer glossary model handles drift far better than fine-tuning, because the glossary is updated on every transcript pass and the model is updated only on each re-tune. Fast-moving product orgs should default to the glossary model even when they meet check 3.
The pattern that drops out: most teams want path 2 (per-customer glossary model) as their operational steady state, with path 1 (pure prompting) as the floor for trial users and one-off jobs, and path 3 (LoRA fine-tuning) as a narrow overlay on specific high-volume, base-model-out-of-distribution content types. The vendors that build only one of the three paths are leaving real quality or real cost on the table. The vendors that conflate paths 1 and 2 — claiming "glossary support" without the durable per-customer state behind it — are selling a feature as a product. The vendors that lead with path 3 because it is the most technically impressive are usually selling at the wrong volume.
What this means for buyer evaluation
If you are evaluating a captioning vendor as a buyer — and the captioning RFP template is the long-form companion here — the four questions worth asking the vendor on a technical demo are:
- Show me the prompt slot you expose to the customer. Confirm the prompt is fed to the decoder pre-transcription, not run as a regex post-process on output. The post-process variant looks similar to a buyer in eval but ceilings ~3 percentage points lower because it cannot bias acoustic-ambiguity ties.
- Show me the per-customer glossary state. Ask to see, for a real customer, the durable glossary with timestamps for when each term was added, source attribution (slide deck X, transcript Y, manual editor add), and the auto-pruning logic for terms that have not appeared in 90 days. A vendor that cannot show you this is doing prompting only.
- Run the same asset twice, three weeks apart, with no manual glossary edits in between. Compare accuracy on long-tail proper nouns. If unchanged, no per-customer compounding. If improved, there is a durable artifact accumulating.
- Ask whether they fine-tune, and on what. If the answer is "we fine-tune everything for every customer," they probably do not — the cost would be prohibitive at SMB pricing. If the answer is "we fine-tune for specific high-volume, out-of-distribution content types and overlay it on the per-customer-glossary path," that is the technically correct architecture and you should listen carefully to their threshold criteria.
The cluster of vendors that pass all four: a small minority of AI-native captioning vendors, GlossCap among them. The cluster that passes (1) and (3): most AI-native captioning vendors plus the AI tier of human-review incumbents. The cluster that passes only (1): consumer-grade YouTube/Vimeo native captions and the cheaper API-only shops. The cluster that fails all four: human-only captioning houses without an ASR product.
The honest GlossCap framing
This post was written by the team building GlossCap and it is biased toward the per-customer-glossary-model path because that is the architecture we picked. We picked it for the reasons in the cost table, the worked examples, and the decision flow above — not the other way around. If we had been wrong about path-2 dominance for the SMB and mid-market segment, the right move would have been to switch to a fine-tuning-first product before scaling sales; the field is small enough that vendors who pick the wrong architecture become structurally uncompetitive on quality-per-dollar within 18 months.
Two tells suggest the analysis is calibrated. First, our two enterprise pilot customers running narrow LoRA fine-tunes overlaid on their surgical-narration content are still fundamentally on path 2 — the LoRA is the overlay, not the foundation — and both opted for that architecture after their own technical evaluations rather than because we proposed it. Second, the SMB tier of the comparison (50-employee SaaS) lands within $200 of our actual onboarding economics for our published Team plan at $99/mo (12 mo × $99 = $1,188 against the calculated $1,400 — close enough that the 30 hrs/mo cap is the actual constraint, which is why the Team plan is sized at exactly that). The published pricing reflects the architecture rather than dictating it.
For a buyer who has read this far, the practical next step is the embed preview — a small live tool that shows the auto-caption-vs-glossary-caption side-by-side on a built-in 30-term dictionary, which is the prompt-only floor of path 1. To see the per-customer-glossary path on your own asset, the Solo plan at $29/mo covers 5 hrs/mo with paste-in glossary, and the Team plan at $99/mo covers 30 hrs/mo with Notion/Confluence/Docs glossary sync — that sync is where the durable glossary lives, and it is what makes the path-2 economics work at SMB scale.
FAQ
Can I combine all three approaches at once?
You should. The best architecture is path 2 (per-customer glossary model) as the foundation, with path 1 (the prompt mechanism it builds on) as a floor visible to trial users and one-off-job customers, and path 3 (LoRA fine-tuning) as a narrow overlay on specific high-volume content subsets where the audio is out of base-model distribution. The three are not exclusive; they are layers in a single architecture. The mistake is treating any single one as the whole product.
Why not retrieval-augmented generation (RAG) on the glossary instead of prompting?
RAG over a glossary at decode time is conceptually cleaner — you would retrieve only the terms relevant to the segment being transcribed and inject them into the prompt slot. In practice, the 224-token decoder window is small enough that "retrieve the right 60 terms from the customer's 800-term glossary based on slide-deck overlap" works as well as more sophisticated retrieval, at materially lower latency. We do this in production. If your glossary grows past ~5,000 terms, more careful retrieval (embedding similarity between the slide deck and each glossary term) becomes worth the engineering, but it is a refinement of path 2 rather than a separate path.
What about Whisper alternatives — Distil-Whisper, faster-whisper, NVIDIA Parakeet, AssemblyAI, Deepgram?
The decision framework is independent of the underlying model. Distil-Whisper and faster-whisper are inference-runtime variants of the same Whisper weights and respond to the same prompt mechanism. NVIDIA Parakeet is a different architecture (RNN-T transducer rather than encoder-decoder) and exposes a hint mechanism that biases token probabilities at decode but in a different shape; the per-customer-glossary architecture above maps cleanly to it. AssemblyAI exposes a per-call word_boost list with explicit weights; Deepgram exposes keywords with intensities. For both, the durable per-customer state architecture (path 2) is the right shape — the model just changes how the boost is expressed at the API boundary.
How big does my customer base have to be before per-customer glossaries pay back?
The per-customer glossary architecture is profitable from customer #1 because it is fundamentally a small-state-machine on top of the same per-asset transcription cost. The break-even is not "how many customers" but "how many hours per customer per month." At 3 hrs/mo per customer, the editor-time savings from compounding the glossary roughly equal the engineering overhead. At 10 hrs/mo per customer, the savings are 3–4× the overhead. We size product tiers around this — see the cap on the Solo plan at 5 hrs/mo; that is where path-1-only economics still work cleanly without the path-2 overlay being load-bearing.
Should I worry about the per-customer glossary being a privacy or compliance liability?
This is a real concern and a real product-design constraint. The durable glossary contains, by construction, capitalised proper nouns from the customer's training video corpus — which can include product names, internal codenames, and personnel names. We treat the glossary as customer data, store it under the same row-level access controls as the source video, do not aggregate across customers, and do not train any shared model on it. The reason we do not fine-tune on customer data is partly economics (path 2 wins anyway) and partly that it would create cross-customer leakage risk that the glossary architecture inherently does not. For HIPAA-bound customers specifically, our handling is documented in the HIPAA training captions reference page, and training content typically does not contain PHI in any case (45 CFR § 160.103) so the glossary almost never carries protected information.
What happens to all this when the next-gen open-weights ASR model lands?
The base-model lift is real — Whisper-large-v3 is materially better than Whisper-large-v2 across every error class, and the next generation will likely close another few percentage points without any of the techniques in this post. That mostly compresses path 1's quality lift (smaller absolute gain when the floor is higher) and slightly compresses path 3's threshold (fewer audio shapes are out of distribution when the model has seen more). It does not affect path 2's structural advantage; the per-customer glossary still owns the 224-token-budget compression problem regardless of base-model quality. Architecturally, path 2 is the most future-proof of the three because it does not depend on any specific base-model weights.
Further reading
- Glossary-biased captioning: how a Whisper prompt beats YouTube auto-captions on engineering terms — the implementation companion to path 1, with the working Python snippet.
- Why 99% caption accuracy matters: the WCAG 2.1 AA threshold — the DCMP scoring methodology used for the quality numbers above.
- Medical training video captions: why Whisper mangles drug names and how to fix it — the worked-example numbers for the regulated-healthcare scenario.
- The hidden half-FTE in your L&D budget — the labour cost line that the per-customer-glossary path is bidding against.
- Rev vs 3Play vs Verbit vs GlossCap pricing breakdown — what each vendor's architecture choice does to their pricing.
- Captioning vendor RFP template — the procurement question set referenced in the buyer-evaluation section.
- Engineering onboarding video captions — the SMB SaaS reference page.
- Medical training video captions — the regulated-healthcare reference page.
- HIPAA training video captions — the privacy-compliance footnote.
- WCAG 2.1 AA captions — the exact spec
- Rev vs GlossCap · 3Play vs GlossCap · Verbit vs GlossCap
- Live demo: caption-mangle scanner