Product Operations · Published 2026-06-04
The caption feedback loop: how training teams compound glossary accuracy from 91% to 99% over 6 months
Most L&D teams think of captions as a one-time event: you upload a video, you get a caption file, you move on. The first time this works, accuracy is whatever it is — typically 88–93% for domain-specific training content before any vocabulary customization. The team reviews the file, makes a few corrections, uploads it to the LMS, and considers the task done. A month later, another batch of training videos comes in, and the same thing happens. Each video starts from scratch. Accuracy fluctuates based on audio quality and speaker style. The correction burden stays constant. Nothing compounds.
The feedback loop is what changes this. It is not a workflow optimization or a manual quality-control procedure — it is a structural feature of how a per-customer glossary architecture is supposed to work. Every correction event generated during a caption review session is a data point: a specific phoneme sequence was decoded incorrectly in a specific context, a human reviewer recognized the error and marked the canonical form, and that correction — if captured and routed correctly — should update the glossary so the same phoneme sequence is decoded correctly in every future session. When this loop runs consistently, the first hour of captioned content for a new topic starts at 91%. The sixth hour, after two correction cycles, starts at 95%. The hundredth hour, after a mature glossary has accumulated six months of correction data, starts at 99.1%. The accuracy is not staying flat. It is compounding. That is the feedback loop.
This post is the operational guide to running that loop in an L&D team. It covers what the six correction signal types are and how each one maps to a glossary update action, how accuracy progresses through three distinct phases over six months, what the feedback pattern looks like in five different verticals, who on the team owns which part of the loop, what tooling the loop requires to function, how a six-month-trained glossary creates a switching cost that generalise transcription vendors cannot match, and what the eight failure modes are that cause teams to stall at 94–96% when they should be at 99%+. The predecessor to this post — the glossary architecture guide — covers what a well-structured glossary entry looks like and why the architecture matters as much as the term selection. This post picks up where that one leaves off: you have a glossary. Now here is how you run the improvement system that makes it compound over time.
TL;DR — three things that matter about the feedback loop
- The loop runs on six signal types, not one. Most teams capture only direct substitution corrections ("this word was wrong, here is the right word"). The other five signals — phoneme pattern rejections, confidence-score distributions, context-collision patterns, vocabulary drift events, and session-level WER trends — are equally important and are what drive accuracy improvement past the 95–96% ceiling that direct-correction-only systems hit.
- The compounding is non-linear and front-loaded. Months 2–4 produce the steepest accuracy gains because the phoneme coverage of high-frequency proper nouns saturates rapidly. By month 4, 80–85% of the original error volume has been closed. The remaining 15–20% is harder: context-ambiguous terms, audio-quality-limited transcriptions, and vocabulary drift from new content. Accepting that the curve flattens after month 4 — and shifting maintenance effort accordingly — is what separates teams that reach 99%+ from teams that plateau at 96–97%.
- At 200+ captioned hours, the glossary represents 12–18 months of accumulated domain intelligence. This is not a lock-in tactic. It is a real data accumulation effect: 1,500–4,000 correction events, 200–400 glossary terms with phonetic variants and context signals, and a WER trajectory that proves compliance. Teams that understand this plan their content migration and vendor evaluation decisions around it — the switching cost is real and it grows with every session.
What the feedback loop is (and is not)
The feedback loop is not periodic manual glossary maintenance. Manual glossary maintenance is when an L&D manager or accessibility lead opens a spreadsheet, reviews a list of terms they suspect are problematic, adds some new ones, removes some stale ones, and hands the updated list to the captioning system. That is useful, but it does not compound: the next session starts at the same accuracy level as the last one unless the human reviewer happened to catch the terms that are actually generating errors in production. The manual approach scales with team effort, not with content volume. It also misses the signals that are not obvious to a reviewer working from a term list: the confidence-score distribution that reveals a term is being decoded correctly on average but incorrectly when it appears after a specific preceding word, or the vocabulary drift signal that shows a new product name introduced six weeks ago is already generating 30 correction events per week across the training library.
The feedback loop is also not model fine-tuning. Fine-tuning updates the parameters of the underlying ASR model on domain-specific data. It can improve accuracy, but it requires substantial labeled audio data, takes weeks to run and validate, is expensive at training time, and produces a model that may perform better on the fine-tuning domain but worse on general vocabulary — a real concern for L&D teams whose training library spans engineering onboarding, compliance training, HR policy, and sales readiness within the same organization. The decision framework post on prompting vs glossary vs fine-tuning covers when fine-tuning is the right choice. For most L&D teams at under 10,000 hours of domain audio, it is not.
What the feedback loop actually is: a structured capture of correction signal from caption review sessions, routed into glossary entry updates — new terms added, existing terms' phonetic variants extended, context signals strengthened, priority weights adjusted — so that the next session starts with a glossary that reflects everything the previous sessions taught the system about the organization's vocabulary. It runs at the speed of the review cadence (weekly for active content producers, biweekly for smaller teams), not at the speed of manual maintenance cycles. And critically, it runs on six distinct signal types, not just the most obvious one.
The six correction signal types
Each signal type maps to a specific glossary update action. Teams that capture all six improve at the full compounding rate. Teams that capture only the first one plateau 2–4 percentage points below where the loop should take them.
Signal 1 — Direct substitution correction. The reviewer sees a wrong word and marks the right word. "The elastic search cluster" → "The Elasticsearch cluster." This is the most obvious signal and the most widely captured. The glossary update action: add "Elasticsearch" to the glossary (or update its phonetic variants) with the decoded form ("elastic search") registered as a phoneme variant. Priority weight: set to high if this is the first time the correction appears; increase further if it has appeared more than three times.
Signal 2 — Phoneme pattern rejection. The reviewer accepts a correction but the corrected form includes a phoneme sequence that appears in multiple wrong decodings across different sessions. Example: every time a speaker says "kubectl" it is decoded differently — "cube control," "cube cuttle," "cube cuddle," "kube control" — with no single wrong form dominating. The glossary update action: register multiple phoneme variants under the canonical form "kubectl" rather than treating each wrong form as a separate correction event. This is why the phonetic variant structure in a well-built glossary entry matters — a flat word list cannot represent "this term has four different wrong decodings and all of them should map to the same canonical form."
Signal 3 — Confidence-score distribution. The caption system assigns a decode confidence score to each token in the transcript. A term that is decoded "correctly" on average but has a wide confidence distribution — sometimes 0.92, sometimes 0.34, for the same phoneme sequence in different contexts — is a term that is at risk of being wrong in the low-confidence sessions. The glossary update action: inspect the low-confidence sessions for this term. If the decoder is uncertain, it often means the context signals for the glossary entry are insufficient. Adding a context signal ("when the term 'Kubernetes' or 'deployment' appears in the same sentence, this phoneme sequence is almost certainly 'kubectl'") narrows the confidence distribution and eliminates the low-confidence misdecode path.
Signal 4 — Context-collision pattern. Two glossary terms share the same or similar phoneme sequence and the decoder has to choose between them based on context. Example: "cloud" (generic) vs "CloudFormation" (AWS service name) vs "Cloud Run" (Google Cloud service name). In an engineering onboarding video that discusses both AWS and GCP simultaneously, the decoder cannot reliably distinguish these without strong context signals. The glossary update action: strengthen the context signals for each term — "CloudFormation" is preceded by "AWS," "stack," or "template"; "Cloud Run" is preceded by "GCP," "container," or "serverless" — so the decoder can choose correctly even when the base phoneme sequence is shared. This is the signal most responsible for the accuracy improvement in months 3–4, when most of the high-frequency substitution corrections have already been captured and the remaining errors are disproportionately context-collision misdecodes.
Signal 5 — Vocabulary drift event. A new term appears in the training content that was not in the glossary at the time of upload: a new product name, an acquired company's product line, a new regulatory citation, a new internal acronym. The signal is not a correction per se — it is a first-occurrence detection. The glossary update action: add the new term immediately, before the next session, rather than waiting for the next manual maintenance cycle. Teams that capture vocabulary drift events in real time keep pace with their vocabulary frontier. Teams that batch them into quarterly glossary reviews fall two to six months behind the content they are producing.
Signal 6 — Session-level WER trend. Each review session produces an aggregate WER figure: total word errors divided by total words in the session. Tracking WER at the session level over time gives the team a leading indicator of glossary health. If session WER is trending down steadily, the loop is working. If session WER plateaus or rises — even slightly, from 3.1% to 3.8% over four consecutive sessions — it is a signal that vocabulary drift has outpaced the glossary update cadence, or that a new category of content (a new training domain, a new instructor whose speech patterns are different) is introducing error types the current glossary does not cover. The glossary update action: investigate the sessions driving the WER uptick. Are they from a specific instructor? A specific content category? A new topic that requires a targeted glossary extension?
The six-month accuracy trajectory
The trajectory is non-linear. It is steep in months 1–3, moderates in months 4–5, and approaches a ceiling in month 6 that represents the fundamental accuracy limit of the combined glossary + model system for that organization's vocabulary density and recording quality. Understanding the shape of the curve is important for two reasons: it tells the team when to expect the most significant correction load (early), and it explains why a team that stops running the loop in month 2 will plateau 5–8 percentage points below where they would end up if they ran it for six months.
The table below shows representative six-month WER trajectories across three verticals. Starting WER is measured on the first month of captioned content before any glossary entries beyond the initial seed set (typically 15–25 terms based on documentation sources). The figures are word accuracy rates (1 − WER), with WER measured against human-reviewed gold-standard transcripts on a consistent held-out test set. The vertical benchmark post covers the baseline figures for each vertical in more detail.
| Month | Engineering onboarding | Healthcare / pharma training | Sales enablement | Financial services compliance | Manufacturing / EHS |
|---|---|---|---|---|---|
| 0 (seed glossary only) | 88.8% | 91.3% | 90.7% | 91.8% | 92.4% |
| 1 (first correction cycle) | 93.2% | 93.9% | 93.5% | 93.7% | 94.1% |
| 2 (high-frequency phonemes covered) | 96.1% | 96.5% | 95.9% | 95.6% | 96.2% |
| 3 (context disambiguation active) | 97.8% | 98.2% | 97.4% | 96.9% | 97.5% |
| 4 (compounding plateau approach) | 98.8% | 99.0% | 98.4% | 97.8% | 98.3% |
| 5 (maintenance mode) | 99.1% | 99.2% | 98.9% | 98.3% | 98.9% |
| 6 (mature glossary) | 99.2% | 99.3% | 99.1% | 98.5% | 99.0% |
Several patterns in this table are worth examining. Engineering content starts the lowest (88.8%) because engineering vocabulary has the highest density of phonemically novel proper nouns — compound technical terms, command-line tool names, API endpoint paths — that Whisper's training corpus contains at low frequency. But it also compresses fastest in months 1–2: the vocabulary is well-defined and stable, and the first 40–60 glossary terms capture a high proportion of the error volume. Healthcare training starts higher (91.3%) because clinical vocabulary is better represented in ASR training data, but requires higher precision — a 1% error rate on drug names in a clinical training context is categorically different from a 1% error rate on general text — and it reaches the highest six-month ceiling (99.3%) because drug names, once correctly glossarized, are stable.
Financial services compliance reaches the lowest six-month ceiling (98.5%) not because the glossary architecture fails but because regulatory citation codes ("FINRA Rule 4511(c)," "Form ADV Part 2A," "Reg BI") contain numerals and abbreviations that do not have consistent phoneme sequences across speakers. "Form ADV" spoken by one instructor sounds different from the same phrase spoken by another, and there is no phonemic context signal that can reliably disambiguate the ambiguity. This is a category of error the feedback loop narrows but cannot fully close — it requires human review of the specific citation-dense segments.
The WCAG 2.1 AA SC 1.2.2 threshold requires word accuracy above 99%. All five verticals cross this threshold by month 4–5 with a structured feedback loop running, and all five remain above it at month 6. Without the feedback loop — running the same content volume through a static glossary that is not updated from correction data — typical six-month accuracy figures for these same verticals land in the 92–95% range: well below the compliance threshold and well below what a feedback-loop-driven glossary achieves.
The three phases of the feedback loop
The six-month trajectory breaks into three operationally distinct phases. Each phase has a different dominant signal type, a different correction load for the review team, and a different set of failure modes. Teams that run all three phases correctly reach 99%+. Teams that exit the loop early — typically by treating the month-2 plateau as "good enough" — leave 2–4 percentage points on the table permanently.
Phase 1: Seed phase (months 1–2)
The seed phase is characterized by high correction volume, high signal value per correction, and rapid accuracy gains. In month 1, a typical team producing 10–15 hours of new training content per month generates 2,000–4,000 correction events. These events are disproportionately concentrated on 20–40 high-frequency proper nouns: product names, platform names, technology acronyms, and whatever specialized vocabulary appears most frequently in the specific training domain.
This concentration is not random. Proper-noun frequency distributions in organizational training content follow a Zipf-like pattern: the 40 most frequently occurring proper nouns account for roughly 80% of the proper-noun error volume. This is the leverage point of the seed phase. If the team correctly identifies and glossarizes these 40 terms — using real correction data, not documentation-sourced term lists — they close 80% of the initial error gap within two months. This is what drives the steep accuracy gain in months 1–2: it is not a smooth improvement across all vocabulary, it is the rapid closure of a concentrated, high-frequency error cluster.
The practical implication for the team is that the seed phase requires active engagement with the correction data. The review cadence should be weekly. Every correction event should be captured (not just the worst errors). The 20–40 highest-frequency corrections should be analyzed for phoneme variant patterns — if a term is being decoded wrong in three different ways, all three wrong forms should be registered as phoneme variants, not just the most common one. The implementation guide for glossary-biased decoding covers the technical mechanism here in detail.
The failure mode in the seed phase is what might be called the "documentation fallacy": building the initial glossary from documentation sources (the product glossary page, the internal wiki, the company style guide) rather than from real correction data. Documentation-sourced glossaries capture terms the organization thinks are important, not the terms the ASR system is actually getting wrong. The terms generating the most errors in production are frequently not the most prominent terms in any documentation — they are the oral-tradition shorthand, the lab-specific abbreviation, the phonemically ambiguous compound that a senior engineer says 30 times per training video. The only way to find these terms is from the correction data itself.
The other seed-phase failure mode is over-expansion: adding 200 terms in month 1 before any correction data is available. A large pre-loaded glossary without correction data has two problems. First, the phoneme variants for terms that have not yet appeared in real audio are incomplete — they are based on the editor's guess about how the word sounds, not on the actual decode sequence the ASR system produces. Second, a 200-term glossary in month 1 creates context-collision risk before the context signal layer is built: all 200 terms compete for phoneme sequences that overlap, and the decoder is more likely to make a wrong disambiguation in the absence of strong context signals. The right seed-phase approach is a 20–40 term core glossary, rapidly expanded by real correction data, not a pre-built comprehensive list.
Phase 2: Compounding phase (months 3–4)
The compounding phase begins when the high-frequency substitution error volume drops significantly — typically by 50–60% from the month-1 peak. At this point, the seed-phase corrections have been absorbed, the 40 highest-frequency proper nouns are decoded correctly in most contexts, and the team's correction workload has fallen from 2,000–4,000 events per month to 700–1,200 events per month.
The error distribution has also shifted. The seed phase was dominated by direct substitution errors on high-frequency terms. The compounding phase is dominated by context-collision errors: terms that are in the glossary and have correct phoneme variants, but are being decoded to the wrong canonical form in specific contexts. The Elasticsearch vs "elastic search" example has been resolved. The new errors are things like: "Amazon S3 bucket" being decoded as "Amazon ES bucket" in a segment where "Elasticsearch" was mentioned three sentences earlier (the decoder is still in an "Elasticsearch" context and misapplies it). Or "Kubernetes pod" being decoded as "Kubernetes pod scheduling" as one string instead of "Kubernetes pod" as the noun and "scheduling" as the next word — a timing artifact.
The dominant signal type in the compounding phase is Signal 4 — context-collision patterns. The glossary update action is context signal construction: adding "preceding terms" and "following terms" constraints to disambiguation-sensitive glossary entries so the decoder's context window is used to choose between phonemically similar options. This is more sophisticated than phoneme variant registration, and it is more time-consuming per correction event — but the accuracy return per event is high in months 3–4 because these context-sensitive errors account for a disproportionate share of the remaining WER.
The compounding phase is also when the team begins seeing the first vocabulary drift events (Signal 5). Six months into a training library's production cycle, the organization has almost certainly released a new product version, updated a compliance framework citation, or introduced a new internal project name. These terms do not appear in the existing glossary. If they are added promptly (within one week of first appearing in captioned content), the decoder picks them up on the first review cycle. If they are missed and accumulate in the uncaptured vocabulary frontier, they become the source of a WER plateau in months 5–6.
The correction workload in the compounding phase is lower than in the seed phase, but the per-correction analysis depth is higher. A seed-phase correction takes 10–30 seconds: mark the wrong form, type the right form, submit. A compounding-phase context-collision analysis takes 3–5 minutes: identify the context collision, review the two competing glossary entries, determine which context signals should distinguish them, add the signals to both entries, verify that the signals are specific enough to avoid false negatives. Teams that treat the compounding phase like the seed phase — making quick surface corrections without analyzing the context-collision pattern — will not extract the full accuracy value from this phase.
Phase 3: Plateau management (months 5+)
By month 5, the glossary has accumulated 150–250 terms with phonetic variants and context signals. Session WER for most verticals is between 0.8% and 1.1% — at or above the WCAG 2.1 AA threshold. The dominant correction signal shifts to Signal 6 (session-level WER trend) and Signal 5 (vocabulary drift). The team's monthly correction workload has dropped to 200–500 events — largely vocabulary drift captures and a small residue of audio-quality-limited transcriptions that the glossary cannot improve.
The plateau management phase is not passive. The glossary needs maintenance at three levels. First, quarterly sweeps: a structured review of all glossary entries to identify stale terms (products that have been renamed or discontinued), weight adjustments for terms whose context signal patterns have shifted over time, and explicit removal of terms that are generating false-positive context collisions with newer terms. Second, event-triggered updates: immediate glossary additions when a vocabulary drift event is detected (a new product launch, an acquisition, a regulatory citation update). Third, annual quality audits: a full WER measurement on a held-out test set to confirm the glossary is maintaining its accuracy against any model updates or corpus shifts in the underlying ASR system.
The plateau management phase has one structural failure mode: complacency. Teams that reach 99%+ WER in month 5 sometimes stop running the weekly review cadence, fall to monthly, then quarterly, then "whenever someone notices a problem." Vocabulary drift continues on the content side whether or not the review cadence is running. New products, new team members, new regulatory citations, new training topics — these add to the vocabulary frontier every week. A team that ran weekly reviews for months 1–4 and then stopped will find their WER rising from 0.8% to 1.5% to 2.8% over months 6–12, as the vocabulary frontier extends beyond the glossary boundary. The 99%+ accuracy that the feedback loop produced is not a permanent achievement. It requires maintenance.
Vertical-specific feedback patterns
The feedback loop runs on the same six signal types across all verticals, but the distribution of signal types, the vocabulary drift rate, and the ceiling accuracy differ by vertical in ways that should inform how the team calibrates their review cadence, their per-correction analysis depth, and their glossary expansion strategy.
Engineering / technical training
Engineering training content has the highest phoneme novelty density of any L&D vertical. Command-line tool names, library names, API endpoint paths, infrastructure service names, and version strings all produce phoneme sequences that Whisper's training corpus has seen at low frequency in the general web text it was trained on. "kubectl," "eksctl," "kube-apiserver," "HorizontalPodAutoscaler," "PersistentVolumeClaim," and "CloudFormation" all produce consistent wrong-form decodings that the seed phase needs to capture explicitly. For more on the engineering vocabulary failure modes in detail, see the implementation guide for engineering glossary-biased decoding.
The engineering feedback loop has two characteristics that distinguish it from other verticals. First, it is front-loaded: the vocabulary is well-defined and stable (platform names do not change every quarter), so the seed-phase correction investment pays off at scale and then requires only vocabulary drift maintenance. The glossary for an engineering team that has run the loop for six months typically has 80–120 terms. After month 4, correction volume drops to fewer than 200 events per month. Second, engineering vocabulary has a compound-word structure that requires explicit phoneme variant registration for word-boundary positions — "kubectl" is decoded as two words ("kube ctrl") or three words ("kube cut l") as often as it is decoded as one wrong word, and the phoneme variant list for this entry needs to cover all the word-boundary variants, not just the most common one.
The engineering onboarding captions page covers the platform-specific vocabulary surfaces (AWS, Azure, GCP, Kubernetes, Terraform, CI/CD toolchains) that most engineering L&D teams will need to address in their seed-phase glossary build.
Healthcare and pharmaceutical training
Healthcare training content has a precision requirement that other verticals do not: a drug name substitution error is not just a compliance risk — it is a potential patient safety record if the captioned training video is the reference a nurse or technician cites during a procedure. The feedback loop for healthcare content therefore needs to operate at a lower WER tolerance threshold than the WCAG 2.1 AA 1% floor. Teams running healthcare training content typically target 0.3–0.5% WER rather than 1.0%, which means the plateau phase for healthcare is earlier but the maintenance phase is more demanding.
The dominant healthcare feedback signal in months 1–3 is Signal 2 (phoneme pattern rejection), not Signal 1 (direct substitution). Drug names that are phonemically similar — "metformin" vs "metronidazole," "lisinopril" vs "labetalol," "tirzepatide" vs "semaglutide" — frequently share phoneme segments in ways that cause the decoder to collapse them under the wrong canonical form across multiple wrong-form variants. Registering all the wrong forms is not sufficient: the context signal construction (Signal 4) is essential, using co-occurring clinical vocabulary ("type 2 diabetes," "A1C," "insulin resistance" for metformin; "bacterial infection," "IV infusion," "anaerobic" for metronidazole) to prevent cross-term collisions. The healthcare glossary case study covers the morpheme-boundary substitution pattern in detail, including why "tirzepatide" is decoded as "tier zip a tide" without a phoneme variant entry and how the INN naming convention creates systematic phoneme families that can be pre-populated in the glossary.
Healthcare training also has one of the most stable vocabulary drift profiles of any vertical. Drug names are governed by the WHO International Nonproprietary Name process, which changes slowly. New drugs enter the clinical training curriculum on a predictable schedule tied to approval timelines. This makes healthcare one of the verticals where the quarterly sweep maintenance cadence is most reliable: new vocabulary arrivals can be anticipated and prepared for in advance rather than discovered reactively via correction events. Teams using HealthStream or HIPAA compliance training platforms will find that the quarterly drug formulary update cycle is a natural trigger for glossary review.
Sales enablement
Sales enablement training has the highest vocabulary drift rate of any vertical. New product names, new feature names, new competitor product names, new pricing structure names, updated SKU designations, and the specific jargon of a new sales methodology all enter the training content every quarter — or in high-velocity sales organizations, every month. The feedback loop for sales enablement must run at a weekly cadence not because the correction volume is high (it typically moderates after month 3) but because the vocabulary drift detection window needs to be short enough to catch new terms before they accumulate 30+ errors in unreviewed sessions.
The dominant sales enablement feedback pattern after month 3 is Signal 5 (vocabulary drift) rather than Signal 4 (context collision). Sales enablement vocabularies do not have the dense phoneme-similarity collisions that engineering or healthcare vocabularies have — "Salesforce" and "HubSpot" are not phonemically ambiguous in the way "kubectl" and "kube control" are. But new terms enter the vocabulary frontier so frequently that the gap between the glossary boundary and the actual content vocabulary widens faster than in any other vertical if the drift-detection cadence is not maintained. A sales enablement team at a SaaS company releasing a new product tier every quarter needs to add the new tier name, its associated feature names, and the revised pricing vocabulary to the glossary within the first week of those names appearing in training content — not the first month.
Sales enablement training content also has the highest phonetic abbreviation density after engineering. "MEDDIC," "BANT," "SPICED," "ARR," "NRR," "CAC," "LTV" — these abbreviations are used as words by sales trainers, and their phoneme sequences are ambiguous without context signals. "ARR" spoken fast sounds like "are" or "our" without a preceding context of "annual" or a following context of "growth." The sales enablement captions page covers the SKU-name problem and the methodology-acronym pattern in the context of WorkRamp, Allego, and other sales readiness platforms.
Financial services compliance
Financial services compliance training has the most numerically dense vocabulary of any vertical — not numerical in the sense of numbers as words, but in the sense of alphanumeric citation codes that are used as vocabulary items: "FINRA Rule 4511(c)," "Form ADV Part 2A," "Reg BI," "SR 14-1," "CCAR submission," "Basel III Pillar 2." These citation codes combine letters, numbers, and Roman numerals in sequences that no ASR model handles well without explicit glossary entries, and the phoneme sequence for "Basel III" spoken by a Boston-accented instructor is different from the same phrase spoken by a Southern-accented colleague.
The feedback loop for financial services reaches a lower ceiling than other verticals (98.3–98.5% at six months) for this structural reason: citation codes contain embedded numerals whose phoneme sequences cannot be made consistent through glossary injection alone. The practical consequence: financial services L&D teams running compliance training content should plan for a residual human review requirement for citation-code-dense segments (regulatory exam prep content, disclosure obligation walkthroughs, recordkeeping requirement summaries) even after the feedback loop matures. The glossary investment still reduces correction volume by 75–80% compared to a no-glossary baseline, but the final 1.5–2.0% of errors in citation-code segments is not closeable by the loop. Document this as a known residual in the compliance audit record — accuracy documentation that shows 98.3% measured WER plus explicit identification of the citation-code error category is a stronger compliance position than accuracy documentation that claims 99%+ without acknowledging the residual category.
Manufacturing, EHS, and safety training
Manufacturing and EHS training content has the most stable vocabulary of any vertical. OSHA chemical names under HazCom (IUPAC nomenclature, common names, CAS registry designations), NFPA ratings, SDS section headers, and OSHA CFR citation codes change infrequently. The vocabulary drift rate for a manufacturing or EHS training library is typically 3–8 new terms per quarter versus 15–30 for sales enablement. This means the feedback loop for EHS content reaches its plateau phase faster (often by month 3) and requires less maintenance effort per unit of content volume than other verticals.
The distinctive feature of the EHS feedback loop is its documentation value. In an OSHA inspection scenario, the inspector may ask for evidence that caption quality was reviewed and that specific safety-critical terms were verified. A timestamped correction event log — "2026-04-12: 'sodium hydroxide' decoded as 'sodium oxide' in SDS handling training; corrected to canonical form; phoneme variant registered; glossary version 3.7 applied to reprocessed file" — is exactly the kind of documentation that demonstrates an active quality review process rather than a passive auto-caption deployment. The EHS feedback loop, run correctly, produces this documentation automatically as a side effect of the review cadence. See the HazCom captioning post for the IUPAC chemical name glossary architecture and the OSHA 1910.1200(h) documentation requirement that makes this audit trail relevant in enforcement contexts.
The manufacturing training captions page covers the platform context (most manufacturing L&D teams use Cornerstone OnDemand, TalentLMS, or a custom LMS built on SCORM), and the safety training captions page covers the OSHA and NFPA vocabulary surfaces that dominate the EHS correction signal in the seed phase.
Who owns the feedback loop: roles, responsibilities, and RACI
The feedback loop fails most often not because the technology does not work but because no one owns the specific activities that keep it running. In most L&D teams, caption correction is treated as an ad hoc task assigned to whoever has bandwidth rather than as a defined operational role with a weekly cadence, clear deliverables, and accountability to a WER metric. The following role definitions are based on what successful feedback loop implementations look like across L&D teams at 50 to 5,000-seat organizations. Not all organizations will have a dedicated person for each role — in smaller teams, one person often owns multiple roles — but the function of each role needs to be explicitly assigned rather than implicitly assumed.
Role definitions
Caption reviewer: Executes the weekly correction review session. Reviews caption files from the prior week's processing sessions, marks substitution errors, flags phoneme pattern rejections, notes vocabulary drift events, and submits the correction log. This role does not require technical knowledge of the glossary architecture — it requires familiarity with the organization's vocabulary and the judgment to recognize when a decoded form is wrong versus unconventional. Time commitment: 30–90 minutes per week in the seed phase, 15–30 minutes per week in the compounding phase, 5–15 minutes per week in the plateau phase.
Glossary maintainer: Takes the correction log from the caption reviewer and executes the glossary update actions. For Signal 1 corrections, this is a 30-second task. For Signal 4 context-collision analysis, it is a 3–5 minute task per entry. The glossary maintainer also runs the quarterly sweep and the event-triggered vocabulary additions. This role requires familiarity with the glossary entry structure — canonical form, phonetic variants, context signals, priority weights — but does not require deep technical background. In most L&D teams, the glossary maintainer role is owned by the same person as the caption reviewer or by the L&D manager. Time commitment: 30–60 minutes per week in the seed phase, 15–30 minutes per week in the compounding phase, 10–20 minutes per week in the plateau phase plus quarterly sweep (1–2 hours).
Content owner: Approves new glossary terms for their domain before they are added. The content owner for engineering training is typically the senior engineer or the engineering L&D program manager. The content owner for compliance training is typically the compliance officer or legal counsel. The content owner's function is to prevent the glossary from drifting toward informal or incorrect canonical forms — if a reviewer marks "Kubernetes" as the correct form but the organization's style guide uses "k8s" as the canonical written form in training documentation, the content owner needs to make that call before the glossary entry is committed. In practice, content owner approval is needed for new terms, not for phoneme variant additions or context signal updates to existing terms. Time commitment: 10–20 minutes per week in the seed phase, 5–10 minutes per week thereafter.
LMS or platform administrator: Handles caption file replacement when a reprocessed file (with the updated glossary applied) needs to be re-delivered to the LMS. In most LMS platforms, replacing a caption file requires a manual upload step — the LMS admin needs to be notified when a reprocessed file is available and perform the replacement before the next learner access event for that video. On platforms that support SCORM or xAPI packages with embedded caption tracks, the replacement may require re-packaging and re-uploading the entire SCORM object. For Kaltura, Cornerstone OnDemand, TalentLMS, and Docebo, the LMS ingestion workflow post covers the platform-specific file replacement procedures. Time commitment: 15–30 minutes per week on average, with spikes when batch retroactive reprocessing is triggered by a major glossary update.
Accessibility lead (or compliance officer): Owns the WER metric and the compliance audit record. Reviews the monthly WER trend report, confirms that the organization is meeting or approaching the WCAG 2.1 SC 1.2.2 accuracy threshold, and maintains the accuracy documentation that would be produced in a DOJ or OCR compliance investigation. The accessibility lead does not execute correction reviews — they monitor the output of the loop and intervene when the trend report shows a WER plateau or regression. This is also the role that flags when a content category reaches the compliance threshold and can be moved to a lighter review cadence. Time commitment: 20–30 minutes per month for trend review, with additional time when an audit is triggered.
RACI by feedback loop activity
| Activity | Caption Reviewer | Glossary Maintainer | Content Owner | LMS Admin | Accessibility Lead |
|---|---|---|---|---|---|
| Weekly correction review session | R | C | — | — | I |
| Glossary entry update (Signal 1–4) | I | R | C (new terms only) | — | — |
| Vocabulary drift detection and add | C | R | A | — | I |
| Monthly WER trend review | I | C | — | — | R/A |
| LMS caption file replacement | — | I | — | R | I |
| Quarterly glossary sweep | C | R | A | — | I |
| Annual WER audit (held-out test set) | — | C | — | — | R/A |
R = Responsible (executes), A = Accountable (approves), C = Consulted (input before action), I = Informed (notified after action).
For small L&D teams where one person owns multiple roles, the most common consolidation is Caption Reviewer + Glossary Maintainer (one person does both the review and the glossary update) with the L&D Manager doubling as Accessibility Lead for the compliance documentation. Content Owner remains separate even in small teams — the subject-matter expert who approves engineering glossary terms is rarely the same person who runs the review session, and that separation matters for canonical form accuracy.
Tooling the feedback loop
The feedback loop requires four data persistence layers to function correctly. Teams that try to run the loop without all four will find that one of the signal types is being captured correctly while others are being lost, producing partial compounding rather than full compounding.
The four data layers
Layer 1 — Glossary state file. The current version of the glossary: a structured artifact with canonical terms, phoneme variant lists, context signals, priority weights, and version metadata. This should not be a flat text file or a spreadsheet — it needs to support the four-component structure described in the glossary architecture post. JSON or YAML format with a schema that enforces the presence of all four components is appropriate. Version-controlled in git or a comparable system so that the WER trend can be correlated with specific glossary versions. The version number should be embedded in the processed caption file metadata so that a reviewer can always determine which glossary version produced a specific output.
Layer 2 — Correction event log. A timestamped, append-only record of every correction event: session ID, video ID, timecode, decoded form, corrected form, signal type (1–6), and the glossary entry that was updated as a result. The correction event log is the raw material for the compounding phase context-collision analysis and for the compliance documentation the Accessibility Lead maintains. It is also the audit trail that proves the review process is active rather than ceremonial — in a DOJ accessibility investigation, the question is not just "what is your WER?" but "how do you know, and how is it maintained?" A correction event log that goes back six months with weekly entries answers that question directly.
Layer 3 — Session-level WER trend data. A time series of WER measurements, one per processing session, stored with the session metadata (date, video count, total minutes, content category, glossary version). The WER trend visualization — even as a simple spreadsheet chart — is what makes the compounding curve visible to the Accessibility Lead and to whoever manages the L&D team's compliance posture. Without the trend data, the team cannot distinguish between "we are at 97.8% because we have been running the loop for three months" and "we dropped from 98.2% to 97.8% over the last four sessions and vocabulary drift is accelerating." The trend direction is more actionable than the point estimate.
Layer 4 — Processed caption file provenance. Each caption file delivered to the LMS should carry metadata linking it to the glossary version that produced it and the correction event that last verified or corrected it. This provenance record is the backbone of the compliance audit trail: it allows the organization to demonstrate that a specific caption file for a specific training video was produced under a specific glossary version that had passed a specific accuracy standard at a specific date. Without this layer, the WER trend data floats free of the individual files it applies to, and the compliance documentation cannot answer "was the caption file currently serving this video produced after the glossary reached the 99% threshold, or before?"
Tooling to avoid
Several common tooling choices break the feedback loop by preventing one or more signal types from being captured correctly.
Spreadsheet-based glossary management is the most common failure point. A spreadsheet can represent canonical terms and perhaps a single "alternative spelling" column, but it cannot represent the phoneme variant list, context signal structure, or priority weight that a fully structured glossary entry requires. Teams that manage glossaries in spreadsheets typically capture Signal 1 (direct substitution) but not Signals 2–4, which means they close 60–70% of the error gap instead of the 85–90% that a fully structured approach achieves. If your team is currently managing the glossary in a spreadsheet, the architecture post has a migration path to the structured format.
Offline or batch correction workflows break the correction event log by introducing delays between the correction event and the glossary update. If a reviewer marks errors in a review session on Monday and the glossary is not updated until Friday, four days of new content have been processed under the stale glossary. In the seed phase, where vocabulary is changing fast and correction volume is high, a 4-day update lag can represent 30–50 correction events that compound on each other. The loop should be structured so that correction events from a review session trigger glossary updates before the next processing session begins.
Vendor-managed terminology lists break the vocabulary drift capture. Several enterprise captioning vendors (Rev, 3Play Media, Verbit) offer human-managed terminology services where the vendor's transcription team maintains a term list on the organization's behalf. The term list is updated on the vendor's review schedule — typically monthly at best. Vocabulary drift events that occur within the review interval are not captured until the next vendor review cycle. For organizations with weekly or biweekly new content production, this means the glossary is always 2–6 weeks behind the vocabulary frontier, and the Signal 5 drift capture that drives the plateau-maintenance phase does not run at the cadence the feedback loop requires.
The switching cost that compounds
At the beginning of this post, the feedback loop was described as a structural feature of how a per-customer glossary architecture is supposed to work — not a lock-in tactic. The distinction matters. The switching cost that accumulates over six months of feedback loop operation is a real data accumulation effect, not a contractual barrier. Understanding what it consists of makes it possible to plan for it honestly rather than discovering it unexpectedly when evaluating alternatives.
At 200 captioned hours with a structured feedback loop running since month 1, a typical mature glossary has accumulated:
- 175–350 glossary entries with canonical forms, phoneme variant lists (average 2.8 variants per entry), context signals (average 3.1 signals per disambiguation-sensitive entry), and priority weights calibrated from real correction data.
- 3,500–8,000 correction events in the event log, spanning 12–24 months of weekly review sessions, timestamped and linked to specific video IDs and glossary versions.
- A WER trajectory from the starting baseline to current state, measured on a consistent held-out test set, demonstrating the compounding curve and the compliance-threshold crossing date.
- Vocabulary drift capture history showing every new term that entered the content vocabulary and was added to the glossary, with the date of first occurrence and the date of glossary addition. This history is the organizational memory of what terms matter in training content — it is not reconstructable from documentation alone, because many of the terms in it were discovered from correction data rather than from any source document.
If the organization migrates to a different captioning vendor at this point, what does the rebuild cost look like? The new vendor starts from their baseline model accuracy — typically 88–93% for the same content before any vocabulary customization. They do not have access to the phoneme variant list, context signal structure, or correction event history that the current glossary encodes. They can request a term list from the organization, but a flat term list is not a structured glossary: it has the 175–350 canonical terms but not the 490–980 phoneme variants, not the 540+ context signals, and not the priority weight calibration that took 3,500–8,000 correction events to build. Re-establishing accuracy from a flat term list through correction and feedback with the new vendor takes 3–5 months to reach the same accuracy level the current glossary already delivers — and it requires the same correction workload the team already invested in months 1–4 of the original loop.
The migration cost also includes re-delivering corrected caption files to the LMS for all content processed under the prior glossary. If the new vendor starts at 91% and the organization is running a compliance program that requires documented 99%+ accuracy, the existing caption file inventory needs to be reprocessed and re-uploaded. For a 200-hour content library across a platform like Cornerstone OnDemand or Kaltura, that is a non-trivial LMS administration task. The LMS caption audit methodology post covers the scope of a reprocessing project of this type — including the per-platform time estimates for caption file replacement.
None of this is an argument against ever switching vendors. Legitimate reasons to switch exist: pricing, customer support quality, feature requirements, platform integrations, contractual terms. The point is that switching has a real cost that is proportional to the maturity of the feedback loop that has been running. Teams that understand this cost — and have a clear record of the glossary data they have accumulated — are in a position to negotiate data portability with their current vendor as a contract term (request the structured glossary export in a standard format as part of contract renewal) rather than discovering at offboarding that the glossary lives inside the vendor's system and cannot be exported in a usable form.
Eight common failure modes
The feedback loop produces the accuracy trajectory in the table above when it runs correctly. The following eight failure modes are the most common reasons teams stall before reaching 99%+, listed roughly in order of frequency.
1. Declaring victory at month 2
The most frequent failure mode. The seed phase closes 60–70% of the initial error gap, correction volume drops significantly, and the team interprets the drop in correction events as a signal that the glossary is "done." In reality, the glossary has closed the high-frequency substitution errors but has not yet built the context signal layer that addresses context-collision misdecodes. Accuracy is typically 95–96% at month 2, which feels satisfactory but is below the WCAG 2.1 AA threshold. The fix: treat the drop in correction volume as a phase transition signal rather than a completion signal. When weekly correction events fall by 50%+ from the seed-phase peak, start the compounding-phase context-collision analysis work rather than reducing review frequency.
2. Documentation-first glossary building
Described above under the seed-phase failure modes. Building the initial glossary from documentation sources (style guides, product glossaries, internal wikis) before any real audio has been processed produces a glossary with the wrong terms. The terms generating the most errors in production are not the terms that appear in documentation. Start with 20–30 terms from documentation as a seed set, process real audio, and use correction data to drive the expansion. This approach reaches 96%+ accuracy 4–6 weeks faster than a documentation-first 200-term glossary approach, because the correction data is more information-dense than the documentation source.
3. Glossary bloat in months 1–2
The opposite of the documentation-first failure but equally common: the team processes real audio, sees many correction events, and responds by adding every corrected term to the glossary immediately at maximum priority weight. A 300-term glossary in month 2 with all terms at high weight creates context collisions between terms that share phoneme subsequences, causing the decoder to choose the wrong high-weight term when the correct term is also in the glossary. The fix: build in order of correction frequency, not in order of term importance to the organization. The terms that appear in the correction log most often get added first, at calibrated weights. Terms that appear only once or twice wait for the compounding phase.
4. Review without rejection
The feedback loop requires correction events — marked errors — to produce signal. A review session where the reviewer accepts all captions without marking any errors produces no correction data, even if the captions contain genuine errors. This happens most often when the reviewer is inexperienced, under time pressure, or working in a domain they are not an expert in (a generalist L&D coordinator reviewing engineering content they do not have the technical vocabulary to evaluate). The fix: assign reviewers who have the domain expertise to recognize errors, and structure the review session so that marking errors is fast and low-friction rather than requiring extensive justification. If the right reviewer is not available, it is better to flag the session for deferred expert review than to accept it with zero corrections.
5. Quarterly update cadence in the seed phase
The feedback loop's compounding rate is proportional to how quickly correction events reach the glossary. Teams that run the glossary update on a quarterly cycle (common in organizations that treat glossary maintenance as an IT change management task requiring a ticket, a review board, and a scheduled deploy window) compress the compounding curve: the correction data accumulates for three months before it is applied, the glossary is always one quarter behind the vocabulary frontier, and the seed phase never fully closes because new content is processed under the stale glossary faster than the quarterly updates can catch up. The fix: separate the operational glossary update (correction event → phoneme variant addition → context signal update) from the IT change management process. The operational glossary update needs to run at weekly cadence. If change management requires a process, scope it to major glossary releases (new domain additions, structural changes) rather than to the weekly correction integration.
6. Domain inconsistency in canonical forms
When multiple content owners are approving glossary terms in the same domain with different style conventions, the correction event log accumulates conflicting signals. "AI-powered" vs "AI powered" vs "AI Powered" (three different reviewers, three different style preferences) sends three contradictory correction signals for the same phoneme sequence. The decoder cannot resolve the contradiction and produces inconsistent output — sometimes correct by one convention, sometimes by another. The fix: establish a canonical form convention for each domain before the seed phase begins, document it (even if the documentation is just a one-page style note), and make all canonical form approvals go through the Content Owner role who enforces the convention. This is especially important for compound terms (hyphenated vs unhyphenated) and capitalization conventions (product names vs common nouns).
7. Forgetting to reprocess the historical library
The feedback loop improves accuracy for new content processed after each glossary update. It does not automatically improve accuracy for caption files already in the LMS library — those files were processed under an earlier glossary version and will have the earlier version's errors until they are reprocessed. Teams that have been running the loop for six months and reach 99%+ on new content sometimes discover that 40% of their LMS library is still at the month-2 glossary version (91–94% accuracy) because no one initiated retroactive reprocessing. The fix: schedule quarterly retroactive reprocessing sweeps for high-priority content categories (compliance training, onboarding, customer-facing content) rather than only processing new content going forward. The LMS caption audit methodology provides a prioritization framework for which content to reprocess first based on compliance risk, access frequency, and error density.
8. Confusing audio quality errors with vocabulary errors
Not all errors in a caption file are vocabulary errors that the glossary can fix. Some errors are audio quality errors: low signal-to-noise ratio, distant microphone placement, room echo, overlapping speech, or a speaker who habitually drops final consonants in a way that makes "Kubernetes" sound like "Kubernete." The glossary cannot fix audio quality errors. When these errors appear in the correction event log, adding phoneme variants for them creates glossary entries that are specific to one speaker's vocal characteristics and one recording setup — entries that may actually hurt accuracy when the same term is spoken by a different instructor in a normal recording environment. The fix: when a correction event appears multiple times for the same video but not for other videos with the same term, investigate the audio quality of that video before adding a phoneme variant. If the error is audio-quality-driven, flag the video for re-recording or for human transcription rather than trying to close the error through glossary adjustment.
FAQ
What is the minimum content volume needed to start seeing the feedback loop compound?
The feedback loop begins producing signal with the first correction event, but meaningful compounding requires enough correction volume to identify pattern-level signals rather than one-off occurrences. In practice, this means approximately 3–5 hours of processed audio in the seed phase — enough to see the high-frequency proper-noun errors repeat across multiple sessions and confirm that they are consistent misdecodes rather than single-occurrence anomalies. Teams with very low production volume (one or two videos per month) can still run the loop, but the compounding rate is slower because there are fewer correction events per unit time. A team producing 2 hours of new content per month will take 9–12 months to reach the same glossary maturity that a team producing 10 hours per month reaches in 4–6 months. The loop works at any volume — it just runs faster at higher volume.
Can we run the feedback loop on content processed by a different captioning vendor?
The feedback loop in its full form — six signal types including confidence-score distributions and context-collision detection — requires access to the decode outputs and confidence scores from the transcription system, not just the final caption text. If a different vendor is producing the caption files and only the final SRT or VTT file is available (not the decode-level metadata), only Signals 1 and 5 (direct substitution and vocabulary drift) can be captured. This partial loop still produces value — closing 60–70% of the error gap — but the context-collision and confidence-distribution signals that drive compounding past 97% require decode-level access. For content that needs to reach 99%+, running it through a system that provides decode-level feedback data is necessary. For content with a lower accuracy target, partial loop operation on vendor-provided files may be sufficient.
How does the feedback loop handle content that gets updated or replaced?
When a training video is updated — new version recorded, same topic — the caption file for the new version starts from the current glossary state rather than from scratch. If the video is updated after the glossary has reached month-4 maturity, the new version will typically caption at 97–98%+ accuracy on first pass for the vocabulary the glossary already covers. Terms specific to the updated content (new product features, updated procedure steps) need to be captured through correction events in the first review cycle for the new version, but the base vocabulary carries over. When a video is replaced rather than updated — a fundamentally different topic replaces the old one — the new content's vocabulary may require a targeted glossary extension for domain-specific terms not already in the glossary. The vocabulary drift detection process handles this: the first session with the new content generates correction events that identify the new terms, which are then added in the standard way. The key discipline is not confusing "replaced content" with "completed glossary" — the glossary needs to continue reflecting the current content vocabulary, not the historical vocabulary of content that has been retired.
What happens during a product rebrand — do we lose the glossary value we built?
A product rebrand is a vocabulary drift event at scale. All entries for the old product name need to be updated (the canonical form changes, the phoneme variant list may need revision if the new name has different phoneme sequences, and the context signals need to reflect the updated name in co-occurrence patterns). This is an event-triggered update that the glossary maintainer executes in one session rather than discovering through correction events over several weeks. The glossary value is not lost — the architecture of the entry, the context signal relationships, and the priority weight calibrations carry over to the new canonical form. What does need updating: the canonical form string, any phoneme variants where the old and new names differ phonetically, and any context signals that were specific to the old name. A rebrand affecting 20–30 product names requires approximately 2–4 hours of glossary update work from the maintainer. The WER impact on content processed before the rebrand entry is updated is typically limited to the specific rebrand terms — the rest of the glossary continues to function normally.
How do we know when we have reached the accuracy ceiling — is it possible to overshoot 99%?
Word accuracy cannot exceed 100%, so "overshooting 99%" is not a risk in the sense of going too far. The practical ceiling question is: is there a point where adding more glossary entries or more phoneme variants starts hurting accuracy rather than improving it? Yes, and it is what was described above under the glossary bloat failure mode. An over-dense glossary — too many terms with insufficient context signal differentiation — creates false-positive disambiguation where the decoder selects a glossary term over a general-vocabulary word that was actually the correct decode. The signal that you are approaching this ceiling is when WER stops decreasing despite continued correction volume and new glossary additions. At that point, the next step is not adding more terms — it is improving the context signal specificity for the most frequently colliding glossary entries. The architectural ceiling for well-structured glossary-biased decoding on domain-specific L&D content is approximately 99.2–99.5% word accuracy, with the residual errors driven primarily by audio quality limitations rather than vocabulary gaps. Once that ceiling is reached, additional glossary investment returns near zero and the effort is better directed to audio quality improvement (better recording environments, better microphones, speaker training on diction) or to human review of specific high-risk segments.
Does running the feedback loop require a dedicated technical role, or can an L&D manager own it?
The correction review, glossary update, and vocabulary drift detection activities described in this post do not require technical skills beyond familiarity with the glossary entry structure. An L&D manager who has read the glossary architecture post has enough background to execute the weekly review session and the standard glossary update actions. The more technical activities — confidence-score distribution analysis (Signal 3), context-collision pattern investigation (Signal 4), and held-out test set WER measurement (annual audit) — are supported by tooling in a well-implemented system rather than requiring manual technical work. The L&D manager role that owns the feedback loop in a small team is spending 30–60 minutes per week on it in the seed phase and 15–20 minutes per week in the plateau phase — time-intensive enough to require explicit scheduling but not so technical as to require a dedicated engineer. The most common role structure for small teams: L&D manager owns Caption Reviewer + Glossary Maintainer + Accessibility Lead, with Content Owners assigned per domain (a senior engineer for engineering content, a compliance officer for compliance content). For teams producing more than 20 hours of new training content per month, the combined role is typically sufficient until the glossary reaches plateau phase; at that point, the maintenance load is light enough that the combined role is sustainable indefinitely.
What happens if we pause caption uploads for 3–6 months — does the glossary decay?
The glossary data does not decay in the technical sense — the phoneme variants, context signals, and priority weights remain in the glossary state file unchanged during a pause. What does change is the vocabulary frontier: the organization's content vocabulary continues evolving (new product releases, policy updates, new team members with domain-specific jargon) even when no new caption uploads are running. When production resumes after a 3–6 month pause, the first batch of new content will produce a vocabulary drift spike: new terms that entered the organizational vocabulary during the pause will generate correction events at seed-phase volume rather than plateau-phase volume. The fix is to treat the post-pause first session as a mini seed phase: increase review frequency for the first two to three weeks after production resumes, execute drift-detection aggressively, and do not expect plateau-phase WER from the first post-pause batch. The return to plateau-phase accuracy after a 3–6 month pause typically takes 3–6 weeks of active feedback loop operation, significantly faster than the original six-month journey — the architectural knowledge in the glossary (phoneme variants, context signals, entry structure) does not need to be rebuilt, only the vocabulary drift caught up.
Putting it together
The caption feedback loop is not a feature. It is an operational system — a set of defined activities, data persistence requirements, and role responsibilities that transforms caption processing from a per-video task into a compounding asset. The per-video perspective produces flat accuracy: every video starts at the baseline WER for the current glossary, and the correction work on one video does not improve the next. The feedback-loop perspective produces compounding accuracy: every correction event improves the glossary, every glossary improvement raises the starting accuracy for all subsequent sessions, and six months of consistent loop operation produces accuracy levels that no static captioning approach can match for domain-specific L&D content.
The compliance case for 99%+ accuracy exists independently of how you achieve it. But the operational case — reduced correction burden, demonstrable WER trajectory for compliance documentation, switching cost that grows with organizational investment — only exists if the feedback loop is running. The 99% accuracy standard is not achievable on an ongoing basis without a system that actively closes the gap between the model's general-vocabulary performance and your organization's specific vocabulary. That system is the feedback loop.
The prerequisites are covered in companion posts: the glossary architecture post covers the entry structure that makes Signals 2–4 capturable; the glossary-biased decoding implementation guide covers the mechanism that makes glossary entries affect decode outputs rather than just serving as a post-processing word list; the LMS audit methodology post covers how to assess the accuracy state of your existing caption library before starting the loop. If you are starting from scratch, the natural sequence is: audit → architecture → seed build → first correction cycle → loop operation. If you are already running some form of glossary maintenance but not a structured feedback loop, the gap is almost always in the signal-capture layer: moving from Signal 1 only to all six signals is where the accuracy difference between 95% and 99%+ lives.
Ready to see what the feedback loop produces on your specific content? The GlossCap embed preview lets you process a 5-minute audio sample with and without a glossary applied and measure the WER difference directly — the starting point for understanding where your specific vocabulary frontier is.