Procurement Operations · Published 2026-06-27

How to design a caption vendor pilot: test corpus, scoring protocol, parallel run period, and the go/no-go threshold that protects you before signature

There is a step in caption vendor procurement that sits between shortlisting from an RFP and signing the contract. Most L&D teams either skip it entirely or execute it in a way that produces no usable data. The step is the controlled pilot: a structured accuracy test on your actual content, scored by an agreed methodology, with a documented pass/fail threshold set before a single file is submitted. When the pilot is skipped, the first real accuracy data you get is from production — which means you discover failure modes after the contract is signed and after employee training has been delivered. When the pilot is run badly — representative content only, no reference transcript, threshold set after seeing the results — it functions as a ritual rather than a test. The vendor scores well because you gave them your easiest content. You approve because the results look acceptable and the budget conversation is already closed. Six months into the production engagement, accuracy on your technical compliance content is 84% and your correction labour cost looks exactly like the model in the hidden half-FTE post. A well-designed pilot prevents this. It produces a documented score on your diagnostic content, measured by the same protocol used in ongoing QA, with a pre-committed threshold that either justifies signature or gives you a contractual basis to walk away. This post covers every design decision in that pilot: how to build the test corpus, what the reference transcript is and why it is the step most teams skip, how to apply DCMP scoring to pilot output, how to design the parallel run period, how to commit to a threshold before the pilot starts, and what to carry from the pilot into the contract SOW.

TL;DR

A caption vendor pilot is a controlled accuracy test on your diagnostic content, not a free trial of the service. The goal is documented accuracy data on the content types most likely to expose failure — technical, regulatory, and specialist vocabulary — not a positive experience with the vendor’s interface or support team. Vendor demos and sample output are marketing artefacts. A pilot produces procurement evidence.
The reference transcript is the step most pilots skip — and without it you cannot measure accuracy. The DCMP Captioning Key formula requires a word-for-word ground-truth transcript produced before the vendor receives your content. Producing one takes 3–4 hours per 10 minutes of audio. Teams that skip this step cannot apply a consistent error-counting methodology and default to subjective “looks right” review, which inflates perceived accuracy by 8–15 percentage points against the DCMP standard.
Set your go/no-go threshold in writing before the pilot starts, not after you see the results. A threshold set after the results is a rationalisation. The three components of a defensible threshold: an aggregate accuracy floor (at minimum 97% aggregate DCMP score), a vocabulary accuracy floor on specialist content (97% on content with ≥50 technical proper nouns per hour), and an outlier threshold (no single video below 93%). Any one of these not met is a no-go without a contractual remediation clause.
The test corpus must over-represent your hardest vocabulary, not your typical production mix. A corpus that is 60% soft-skills content will produce a score 8–12 percentage points higher than a corpus drawn from your hardest technical and regulatory content. The pilot corpus should be at minimum 40% technical or specialist vocabulary content. Representative sampling is for QA audits of ongoing production, not for vendor selection pilots where the goal is to find the ceiling of failure.
Pilot data feeds directly into contract SOW language. The accuracy floor from the pilot becomes the contracted SLA. The glossary configuration used in the pilot becomes the minimum baseline. The scoring methodology (DCMP Captioning Key) becomes the measurement standard specified in the contract. A pilot without documented methodology produces no leverage at the contract stage.

What a pilot is and isn’t

The first source of confusion in caption vendor pilots is what the pilot is for. Teams often frame it as a “trial period” — a chance to see whether the platform is easy to use, whether the vendor’s support team responds promptly, and whether the output looks good on the videos they chose to send. This is not a pilot. It is an extended demo.

A pilot in the procurement sense is a controlled experiment designed to produce a specific type of evidence: documented accuracy data on your diagnostic content, measured by a specified methodology, compared against a pre-committed threshold. It produces a binary output — pass or no-go — that either justifies contract signature or requires renegotiation of terms. The experience of using the platform is irrelevant to the pilot output. The vendor’s support quality is irrelevant to the pilot output. What the pilot produces is a score, applied consistently to content that represents the hardest accuracy challenge your programme will face.

Why the distinction matters for procurement

When a team runs a “trial period” rather than a controlled pilot, two things happen. First, the vendor learns which content types you are testing and, if they are honest, will tell you that their service performs differently on different content. If they are less honest, they will learn from the feedback loop what accuracy level produces approval and optimise toward it. Second, the evaluation process produces no contractual leverage. If the vendor delivered 94% accuracy on your soft-skills content during the trial and you approved the contract on that basis, you have no grounds for remediation when they deliver 81% accuracy on your FINRA compliance modules in production.

A controlled pilot prevents both problems. The vendor knows what they will be scored on (the methodology) but not what content will be in the corpus (they don’t get to select it). The result is a documented score that either meets the pre-committed threshold or does not. If it does not meet the threshold, you either walk away or negotiate a remediation clause into the SOW before signature. If it does meet the threshold, the pilot result becomes the baseline accuracy commitment in the contract.

Where the pilot sits in the procurement sequence

The pilot occurs after the RFP evaluation and shortlisting and before contract signature. The RFP playbook post covers the process of building a shortlist from vendor proposals; the vendor SLA and contract review checklist covers what to negotiate in the contract once you have a chosen vendor. The pilot is the step between those two: it converts a shortlist position into a documented accuracy evaluation that either supports or undermines the contract terms the vendor is proposing.

For a straightforward procurement (one vendor on the shortlist, replacement of an incumbent whose performance is already known, or a low-stakes content library with soft-skills content only), a formal pilot may be disproportionate. For any procurement involving specialist vocabulary content — technical training, regulatory compliance, healthcare, financial services, legal — a pilot is the only mechanism that produces data before you are committed to a production engagement.

Pilot vs. vendor sample output

Vendors routinely offer sample output as part of the RFP or sales process. A vendor sample is not a pilot result. The vendor selects which content to demonstrate on, optimises their configuration for that content, and may apply human review to the sample output before delivering it to you. The vendor accuracy evaluation methodology post covers how to evaluate vendor claims independently; the pilot is the mechanism that produces the independent evidence by submitting your content, not the vendor’s selected sample.

The key distinction: in a pilot, you control the corpus. In a sample demonstration, the vendor controls the corpus. Any accuracy number produced by a vendor-controlled corpus is a marketing figure, not a procurement figure.

Building the test corpus

The test corpus is the set of content submitted to the vendor during the pilot. It is the single most important variable in whether the pilot produces useful data. A corpus that misrepresents your production mix will produce a score that does not predict production accuracy. A corpus that is too short will not provide a statistically stable result. A corpus without speaker diversity will not expose the accuracy floor on your hardest recordings.

Content type distribution

The pilot corpus should not be a representative sample of your production library. It should be a diagnostic sample — weighted toward the content types most likely to expose accuracy failure. The practical implication:

Pilot corpus content type distribution
Content type	Recommended share	Why it belongs in the corpus
Technical / specialist vocabulary	40% minimum	Engineering, compliance, medical, financial, legal content. The highest ASR failure rate and the most consequential errors. If the vendor fails here, they fail where it matters most.
Procedural / regulatory compliance	30%	Step-by-step procedure video, regulatory citations, process documentation. Medium vocabulary density but high accuracy stakes — a mislabelled step in a LOTO procedure or a misquoted regulation in a compliance module is not a minor error.
General L&D content	20%	Management, leadership, onboarding content without specialist vocabulary. Provides a baseline to confirm that the vendor meets the minimum threshold on content where any modern ASR performs well. If they fail here, the shortlist decision was wrong.
Soft skills / interpersonal	10% maximum	Present in the corpus only to confirm that the general-purpose capability is intact. Not meaningful as an accuracy signal because every modern ASR service scores 92–96% on soft-skills content regardless of vendor configuration.

The reason for the 40% technical / specialist floor: the LMS auto-caption accuracy comparison post documents that the spread between ASR performance on soft-skills content and technical content is 8–15 percentage points for general-purpose models. A corpus that is 70% soft-skills content will produce a score 8–12 points above what the same vendor will achieve on your engineering or compliance modules in production. The pilot corpus must be weighted toward diagnostic content to produce a result that predicts production accuracy.

Duration and volume minimums

The corpus must be large enough to produce a statistically stable result. The minimum for a caption vendor pilot:

Total content duration: At least 2 hours of audio content (ideally 3–4 hours). Below 2 hours, a single outlier video in either direction can move the aggregate score by 2–3 percentage points in a way that does not reflect the vendor’s steady-state accuracy on your content type.
Number of distinct files: At least 15 videos, ideally 20–30. Testing 15 videos of 8–10 minutes each produces both an aggregate score and a per-video distribution that lets you see the outlier floor — the lowest single-video score, which is a critical component of the go/no-go threshold.
File length range: Include files from 3 minutes to 25 minutes. Very short files (under 3 minutes) do not allow a timing error assessment. Very long files (over 30 minutes) should also be included if your production library contains them, since some vendors’ accuracy degrades on long-form content due to context window limits or acoustic model reset behaviour.

Speaker diversity

The corpus must include at least 5 distinct speakers. Speaker diversity is the second most common gap in pilot corpus design (after content type distribution). Including a range of speakers exposes the vendor’s accuracy ceiling on the populations your programme actually trains:

Non-native English speakers: If your programme includes content from non-native English speakers — which is common in global L&D programmes — include at least 2–3 recordings from speakers with non-native accents. ASR accuracy on non-native speech varies substantially by vendor depending on their acoustic model training data.
Fast speakers: Some subject-matter experts who contribute to training content speak faster than the conversational average. Include at least one recording from a speaker who delivers content at above-average speed.
Technical experts with domain-dense delivery: Speakers who are deep domain experts in the content they are presenting often use specialist terminology at a density that general speakers do not. The accuracy failure mode here is not speed — it is the combination of specialist vocabulary and the speaker’s assumption that the terminology is self-evident, so they deliver it without the redundant context that helps ASR recover from uncertainty.

Audio condition diversity

Include recordings from at least three distinct audio environments:

Professional studio recording: Controlled acoustic environment, close-field microphone, no background noise. Every vendor scores highest on this condition. It is a useful baseline but should not dominate the corpus.
Conference room or office recording: Moderate room reverb, variable distance from microphone, occasional background noise. Represents the most common production condition for internally produced training content.
Home office or remote recording: Variable audio quality, HVAC noise, microphone positioning differences. The remote and hybrid workforce captioning post documents that home-office audio introduces an additional 3–8 percentage point accuracy penalty on top of content type difficulty. If your programme includes remote-recorded content — which is now common — the pilot corpus must include it.

The diagnostic content principle

The test corpus should contain your hardest content, not your typical content. This is counterintuitive — teams often feel that using “representative” content is the fairer evaluation. The problem is that representative content inflates the vendor’s score relative to what they will achieve on your hardest modules, and those hardest modules are exactly the ones where inaccurate captions create compliance exposure.

A useful framing: the pilot corpus is not designed to predict average production accuracy. It is designed to predict the accuracy floor. If the vendor passes the pilot at 97% aggregate on diagnostic content, you can have reasonable confidence they will meet the 99% threshold on general production content. If the vendor passes at 97% on your soft-skills content but you have no diagnostic data, you have no prediction for what they will achieve on the content where failure matters most.

The reference transcript: the step most pilots skip

The reference transcript is the word-for-word ground-truth transcription of the audio content in your pilot corpus. It is the document you compare the vendor’s caption output against to count errors. Without a reference transcript, you cannot apply the DCMP Captioning Key formula. Without the DCMP formula, you cannot produce a consistent accuracy score. Without a consistent accuracy score, the pilot produces no useful data — only subjective impressions.

This is the step that most caption vendor pilots skip, and it is the reason most pilots produce no actionable output. The teams that skip it typically default to “read through the captions while watching the video and see if anything looks wrong.” This is not accuracy measurement. It is human review, which is subject to attention fatigue, familiarity bias (you become desensitised to errors in content you helped produce), and anchoring bias (if the first three videos look good, you read the remaining ones more charitably). Human review overestimates caption accuracy by 8–15 percentage points against the DCMP standard, based on the same phenomenon documented in the accuracy evaluation methodology post.

What the reference transcript must contain

The reference transcript is a verbatim document. It must contain:

Every spoken word: Including filler words (um, uh, you know), restarts, and false starts. Under the DCMP Captioning Key protocol, omitted filler words that do not affect meaning do not count as errors. But the reference transcript must include them so that the error-counting reviewer can make the category determination.
Speaker identification: When multiple speakers are present in a video (an interview format, a moderated panel, a dialogue-based scenario), the reference transcript must identify each speaker at each speaker turn. Speaker identification errors are a distinct error category under DCMP.
Technical terms spelled correctly: This is where the reference transcript pays for itself. When you produce the reference transcript, you must decide how to spell every technical term, regulatory citation, product name, and proper noun in your content. The decisions you make in the reference transcript become the ground truth against which the vendor’s output is scored. A reference transcript that spells FINRA correctly, spells Reg-BI correctly, and spells your organisation’s LMS product names correctly gives the error-counting reviewer an unambiguous basis for counting substitution errors in the vendor’s output.
Timing marks at paragraph breaks: Not required at the word level, but useful at the paragraph or major topic break level. Timing marks in the reference transcript help the error-counting reviewer locate the corresponding position in the vendor’s SRT file when scoring timing errors.

How long it takes and who produces it

Producing a reference transcript for audio content takes approximately 3–4 hours per 10 minutes of audio for a competent transcriptionist working on domain-unfamiliar content. For content in a domain the transcriptionist knows well (a medical transcriptionist producing a reference transcript for healthcare training content), this drops to 2–3 hours per 10 minutes. For a 2-hour pilot corpus, budget 36–48 hours of transcription time.

Who should produce it:

Internal SME + transcriptionist combination: A domain expert reviews the transcript for specialist terminology accuracy while a transcriptionist handles the verbatim production. This is the most practical approach for content with dense specialist vocabulary. The SME does not need to produce the full transcript — they review the terminology-heavy sections and certify that proper nouns, regulatory citations, and specialist terms are spelled as the organisation uses them.
Professional transcriptionist: For content without dense specialist vocabulary, a professional transcriptionist can produce a reference transcript at a quality sufficient for DCMP scoring without SME review. Services that offer verbatim transcription (not clean-read or edited transcription) are appropriate; clean-read transcription omits filler words and corrects disfluencies in a way that modifies the ground truth.
The pilot vendor’s transcript: Do not use the vendor’s ASR output as the reference transcript. This creates circular validation — you are scoring the vendor’s output against the vendor’s output. The reference transcript must be produced independently of the vendor.

When to produce it

The reference transcript must be produced before the pilot corpus is submitted to the vendor. There are two reasons for this. First, producing the reference transcript before submission means you cannot be influenced by what the vendor delivered when you make your ground-truth decisions on terminology spelling. If you produce the reference transcript after reviewing the vendor’s output, you may unconsciously adopt the vendor’s rendering of ambiguous terms rather than applying the organisation’s preferred spelling. Second, producing the reference transcript before submission forces you to identify the technically demanding content in your corpus before the pilot begins, which is itself valuable preparation for the vocabulary configuration conversation with the vendor.

Two-person sign-off for the reference transcript

For content with regulatory accuracy requirements — financial compliance, healthcare, safety procedures — the reference transcript should have a two-person sign-off: one person to produce the verbatim transcript and one domain expert to certify that all specialist terms are spelled correctly. The caption glossary maintenance workflow post covers the two-person sign-off requirement for glossary term deprecations; the same principle applies to reference transcripts produced for regulatory compliance content.

The two-person sign-off produces an auditable document. If the pilot result is disputed — if the vendor challenges a substitution error count on a term they claim they spelled according to the industry standard rather than the organisation’s preferred spelling — a certified reference transcript with a domain expert sign-off provides a defensible basis for the error count.

DCMP scoring protocol applied to pilot output

The caption quality error rate calculator post covers the DCMP Captioning Key formula in detail. The key points for applying it to pilot output:

The four error categories

DCMP counts four categories of caption errors, each of which affects the accuracy calculation:

Substitution (S): A word in the vendor’s caption that is different from the corresponding word in the reference transcript. “FINRA” transcribed as “Finnra” is a substitution. “Reg-BI” transcribed as “regby” is a substitution. One substitution counts as one error regardless of how short or long the substituted word is.
Deletion (D): A word present in the reference transcript that is absent from the vendor’s caption. If the speaker said “the registered representative must,” and the caption reads “the representative must,” that is one deletion error for the missing word “registered.”
Insertion (I): A word present in the vendor’s caption that is absent from the reference transcript. Insertions are common when ASR hallucinates words during low-energy audio passages or speaker pauses.
Timing error (T): A caption cue that is presented more than 2 seconds before or after the corresponding audio. Timing errors are the category most frequently omitted from informal pilot evaluations. The why 99% caption accuracy matters post notes that timing errors count as full errors under DCMP, and a vendor with excellent word accuracy but poor timing synchronisation can still fall below the 99% DCMP threshold.

What does not count as an error

Under the DCMP Captioning Key protocol, the following are not counted as errors:

Trivial punctuation differences that do not affect meaning (a comma vs. no comma, a period vs. an ellipsis at a mid-sentence break)
Omission of filler words (um, uh) where the filler word’s omission does not affect meaning
Caption line-break decisions (how captions are split across two display lines)
Caption display timing within the 2-second synchronisation window

Knowing what counts and what does not count matters for the go/no-go decision. A vendor whose output has many stylistic differences in punctuation and line-break decisions but few word errors and few timing errors may score well under DCMP despite appearing “different” from your internal style guide. The DCMP score is the accuracy signal for pilot evaluation; style conformance is a separate criterion addressed in the contract SOW.

The formula and how to apply it across a batch

The DCMP formula is: WER (Word Error Rate) = (S + D + I + T) / N × 100, where N is the total number of captionable words in the reference transcript. Accuracy = 100 − WER.

For a pilot batch, calculate two scores:

Aggregate score: Sum all errors (S, D, I, T) across all files in the corpus. Divide by the total word count across all files. This is the aggregate pilot accuracy percentage. The aggregate score is what you compare against the aggregate accuracy floor in your go/no-go threshold.
Per-video scores: Calculate the DCMP accuracy for each individual file. The distribution of per-video scores is what you compare against the outlier threshold in your go/no-go threshold. A vendor who achieves 97% aggregate but has three videos below 89% has a systematic failure mode on a specific content type or speaker — that is the signal the outlier threshold is designed to surface.

Who does the error counting

The error counting should be done by someone who did not produce the reference transcript (to avoid anchoring) and who is not the person who will have to defend the vendor selection to leadership (to avoid motivated reasoning). The QA methodology post covers the RACI for caption quality review; the pilot error counting is a specialised application of the same QA skill set. At minimum, the error-counting reviewer should be the accessibility coordinator or QA lead — the person in the role described in the accessibility coordinator playbook — rather than the L&D director or budget owner who is commercially motivated toward a specific outcome.

The timing accuracy gap: the check most pilots miss

Most informal pilot evaluations check word accuracy (whether the right words appear) but do not systematically check timing accuracy (whether caption cues appear at the right time). Timing errors are counted under DCMP, and they can be significant for vendors who use batch ASR with asynchronous alignment rather than audio-synced real-time captioning. A vendor whose captions are consistently accurate at the word level but whose cues run 3–5 seconds behind the audio will produce a DCMP score below 99% even if every word is correct, because the timing displacement is counted as a timing error on every affected cue.

To check timing accuracy in a pilot: open the vendor’s SRT file alongside the reference transcript, play the video at each major speaker turn, and note the offset between when the audio begins and when the corresponding caption cue begins. A 2-second threshold means that a cue that begins 2.1 seconds after the audio onset is a timing error. Check a random sample of 50 cues per file — not every cue, but not just the first few cues, which vendors often optimise manually for demonstration purposes.

Parallel run design

The test corpus pilot produces a point-in-time accuracy score. A parallel run produces a steady-state accuracy profile across a period of live production. The two serve different purposes: the corpus pilot reveals whether the vendor can meet your threshold on your diagnostic content; the parallel run reveals whether they maintain that threshold under production conditions over time.

When to run parallel vs. sequential evaluation

A parallel run is appropriate when:

You are replacing an incumbent vendor (parallel run = new vendor vs. incumbent on the same content over the same period)
You have two vendors on the final shortlist and want a direct comparison under identical conditions
The production programme is high-stakes enough that steady-state accuracy data matters, not just point-in-time pilot data

A parallel run is not required when you are selecting a first vendor (no incumbent to run against) and the corpus pilot data alone meets the threshold. For first-vendor selections in non-specialist content environments, the corpus pilot is typically sufficient. For high-stakes content environments — healthcare, financial services, legal, manufacturing — a parallel run provides a risk reduction worth the operational complexity of running two vendors simultaneously.

Duration and volume minimums for a parallel run

Duration: Minimum 4 weeks, ideally 8 weeks. A 4-week parallel run is long enough to capture a production cadence across different content types and speakers. An 8-week run captures seasonal variation and includes enough volume to see whether the vendor’s accuracy degrades over time as they move from an optimised pilot configuration to a standard production configuration.
Volume: Minimum 30 videos OR minimum 10 hours of content, whichever is larger. The volume minimum ensures that the parallel run produces enough data points for the per-video distribution to be meaningful rather than dominated by a few outliers.

The same-period, same-content principle

In a parallel run against an incumbent or between two candidates, both vendors must receive the same content during the same time period. This seems obvious, but it is frequently violated in practice: teams submit the new vendor’s test content in one month and evaluate it against the incumbent’s production output from the prior month. Content produced in different time periods is a different speaker mix, different audio conditions, and different subject matter — which makes the comparison meaningless.

The same-period, same-content principle means:

Every file submitted to the new vendor during the parallel run is also submitted to the incumbent (or to the second shortlisted vendor)
The submission happens at the same time (not with a lag)
The output from both vendors is scored against the same reference transcripts
The scoring reviewer is the same person or team for both vendors

If the incumbent refuses to participate in a parallel evaluation — a position some vendors take because they know their accuracy will not compare favourably — this is itself a signal. A vendor confident in their production accuracy should welcome a parallel evaluation. A vendor who resists it is telling you something about what the comparison would show. The vendor transition playbook post covers how to manage the incumbent relationship during a vendor change process, including the parallel run period.

Glossary configuration parity

Both vendors in a parallel run must be given the same glossary configuration. A parallel run where the incumbent has had your glossary for three years and the new vendor is running without a glossary does not measure vendor capability — it measures the value of three years of glossary compounding, which is not what the parallel run is for. Configure the new vendor with a complete copy of your current glossary before the parallel run begins. The glossary architecture post covers how to structure the export; the caption feedback loop post covers the accuracy implications of glossary transfer.

Setting the go/no-go threshold before the pilot starts

The go/no-go threshold is the specification that converts pilot results into a binary procurement decision. It must be set in writing before a single file is submitted to the vendor. A threshold set after the results are known is not a threshold — it is a rationalisation that works backward from the result you want to approve toward a specification that makes the approval look principled.

Threshold drift is the most common failure mode in caption vendor pilots among teams that do run a pilot. They start with a vague commitment to “see how the accuracy looks” and then, when the results come in at 93% aggregate, decide that 93% is “pretty good for our content” and approve. Ninety-three percent aggregate accuracy on a diagnostic corpus is not close to the 99% WCAG 2.1 AA threshold for production content — it implies roughly 130 errors per 1,000 captionable words, which at a standard correction rate produces the correction labour cost detailed in the hidden half-FTE post. But teams that have already emotionally committed to a vendor selection and a budget approval tend to approve 93% as good enough, because the alternative is restarting the procurement process.

The three-component threshold

A defensible go/no-go threshold has three components, all of which must be met:

Component 1: Aggregate accuracy floor

The minimum acceptable DCMP accuracy score across all files in the pilot corpus, calculated as the aggregate (sum of all errors divided by total word count). The recommended aggregate accuracy floor for a pilot is 97% aggregate DCMP accuracy. This is 2 percentage points below the 99% WCAG 2.1 AA production threshold for a reason: pilot content is diagnostic (harder than average production content), and a vendor who achieves 97% on your diagnostic pilot corpus can reasonably be expected to achieve 99% on the general production mix when properly configured. If the pilot corpus is representative rather than diagnostic, the aggregate floor should be set at 99%, because the pilot is not applying the diagnostic penalty.

If the aggregate score is below the floor, the pilot fails regardless of the per-video distribution. A vendor who is 96% aggregate on your diagnostic corpus has a production accuracy prediction below 98% — not sufficient for a programme that needs to meet WCAG 2.1 AA SC 1.2.2 across the full content library.

Component 2: Vocabulary accuracy floor on specialist content

The minimum acceptable DCMP accuracy score on the specialist vocabulary content in the corpus — content with at least 50 technical proper nouns per hour of audio. Calculate this score separately from the aggregate: take only the files classified as “technical / specialist vocabulary” content and compute a DCMP score for that subset alone.

The recommended vocabulary accuracy floor is 97% DCMP accuracy on specialist content. This is the same floor as the aggregate, but it must be met on the subset of content most likely to expose failure. A vendor who achieves 98.5% aggregate but only 89% on your engineering compliance modules has passed the aggregate threshold while failing the vocabulary threshold — which means they will underperform on exactly the content where inaccurate captions create compliance exposure.

The vocabulary accuracy floor is particularly important for verticals with regulatory documentation requirements. A financial services firm whose FINRA compliance training content scores 89% on the pilot has a documentation problem, not just a quality problem, as described in the financial services captioning post. A healthcare organisation whose clinical procedure training scores 87% has a patient safety training documentation problem, as described in the medical training captions post.

Component 3: Outlier threshold

The minimum acceptable DCMP accuracy score for any single video in the pilot corpus. The recommended outlier threshold is 93% minimum per-video DCMP accuracy. No video in the pilot corpus should score below 93% — not as an aggregate, but as an individual file score.

The outlier threshold serves a different purpose from the aggregate floor. A vendor can achieve a 97% aggregate with a smooth distribution of scores (all between 95% and 99%) or with an uneven distribution (most files at 99% and a few files at 88–90%). The aggregate hides the latter pattern. The outlier threshold surfaces it. A vendor who produces three files at 87% during the pilot has a systematic failure mode on a specific content type, speaker, or audio condition. That failure mode will not disappear in production — it will recur on any similar content you submit.

Documenting the threshold before the pilot starts

The threshold should be documented in a signed addendum to the pilot agreement, or at minimum in a written email exchange with the vendor, before the first file is submitted. The documentation should specify:

The aggregate accuracy floor (e.g., 97% aggregate DCMP)
The vocabulary accuracy floor (e.g., 97% DCMP on files classified as technical/specialist content)
The outlier threshold (e.g., no single file below 93% DCMP)
The scoring methodology (DCMP Captioning Key, as documented in the error rate calculator post)
Who will perform the scoring (the internal QA lead or accessibility coordinator)
What happens if the threshold is not met (walk away, or negotiate a remediation clause before proceeding)

The vendor’s written acknowledgement that these are the evaluation criteria is important. It prevents a vendor from disputing the evaluation methodology after they receive a no-go result. A vendor who refuses to acknowledge the evaluation criteria in writing before the pilot starts is also telling you something about how contract disputes will be handled later.

Escalation: when results land between thresholds

Define before the pilot what happens if the results are ambiguous. Two common scenarios:

Aggregate passes but outlier fails: One or more files are below the 93% outlier threshold, but the aggregate is at or above 97%. This typically indicates a systematic failure mode on a specific content type or speaker. The appropriate response is to identify the failure mode, give the vendor an opportunity to address it (by adding terms to the glossary, adjusting the acoustic configuration, or replacing the content type with a different approach), and rerun the pilot on those specific files before proceeding.
Vocabulary floor fails by a small margin (1–2 percentage points): The specialist content subset scores 95–96% rather than 97%. The appropriate response depends on the size of your specialist vocabulary content relative to the total library. If specialist content is 15% of your library and the general content scores 98.5%, the expected production accuracy is still close to 99%. If specialist content is 60% of your library, a 95% vocabulary floor predicts unacceptable production accuracy. Define this escalation path before the pilot rather than improvising after seeing the numbers.

Pilot mistakes that inflate vendor scores

The following mistakes appear in pilot designs frequently enough to warrant a specific taxonomy. Each one inflates the vendor’s apparent accuracy relative to what they will achieve in production.

Mistake 1: Vendor selects the test content

Some vendors offer to run a “free pilot” where they ask you to send them some representative content. The “representative content” you choose, guided by a vendor sales team that has seen hundreds of evaluations, will be the easiest content in your library. Even without intentional optimisation on the buyer’s side, the content that buyers tend to select for a positive first impression is smoother, more professionally recorded, and more general in vocabulary than the hardest content in the production programme.

The rule: you select the pilot corpus, not the vendor. The vendor may suggest content categories to include (to ensure you test configurations they are confident in), but the specific files must be selected by your team based on the diagnostic content principle described above.

Mistake 2: Vendor pre-corrects output before delivery

Human-assisted caption services — services that route ASR output through a human correction step before delivery — may apply additional correction to pilot deliveries beyond what they do in standard production. This is not necessarily dishonest: it may reflect the vendor’s standard highest-accuracy tier, which they offer during the pilot to demonstrate their ceiling. The problem occurs when the tier used in the pilot is not the tier specified in the contract.

The rule: require the vendor to specify in writing which service tier (ASR-only, ASR + human review, fully human-captioned) is being used in the pilot, and confirm that this tier is what is priced in the proposed contract. A pilot run at the “certified human review” tier and a contract at the “AI-assisted” tier produces misleading pilot data.

Mistake 3: Soft-skills-only or generally-easy test corpus

As documented above, any modern ASR service — including the models powering LMS native auto-captioning — achieves 92–96% DCMP accuracy on soft-skills, management, and interpersonal communication content. A pilot corpus that is 70% soft-skills content will produce aggregate scores of 94–97% from almost any vendor, regardless of their capability on technical content. This error does not require bad faith from the vendor: the buyer may simply not know what diagnostic content looks like.

The rule: classify each file in the proposed corpus by content type before submission. If less than 40% of the corpus by duration falls in the technical / specialist vocabulary category, rebuild the corpus before the pilot begins.

Mistake 4: No timing accuracy check

Timing errors are the error category most frequently omitted from informal pilot evaluations. A timing error, under DCMP, is a caption cue that starts more than 2 seconds before or after the corresponding audio. A vendor whose captions are accurate in word content but systematically delayed by 3–5 seconds — which can happen with batch ASR processing that does not use audio-synced alignment — will produce a DCMP score below the threshold on timing errors alone.

The rule: include a timing check in the scoring protocol. For each file, check a random sample of 50 cue start times against the reference video. Record the offset in seconds and flag any cue with an offset > 2 seconds as a timing error.

Mistake 5: Accepting the vendor’s self-reported accuracy score

Some vendors include an accuracy report with their pilot delivery — a document claiming that the output achieved 97.8% or 98.4% accuracy. This report is based on the vendor’s own internal measurement methodology, which may not be DCMP Captioning Key, may have been applied to a sample of the output rather than the full corpus, and may have been produced before human correction passes. A vendor’s self-reported score is a marketing figure, as documented in the accuracy evaluation methodology post.

The rule: do not accept vendor-provided accuracy scores as the pilot result. The pilot result is the DCMP score produced by your independent error-counting reviewer against your reference transcript. The vendor’s self-reported score is supplementary information, not the evaluation output.

Mistake 6: Glossary not configured before the pilot

A vendor pilot without glossary configuration tests the vendor’s base ASR model, not the vendor’s configured service that you would actually use in production. If your production programme will use a glossary — which it should, as documented in the glossary architecture post — the pilot must be run with the glossary configured. A pilot without glossary configuration will produce a score 5–15 percentage points below what the configured service will achieve, depending on your content’s specialist vocabulary density.

The rule: provide the vendor with a complete initial glossary before the pilot begins. If you are evaluating a first vendor and have no existing glossary, seed one with your 50–100 highest-frequency specialist terms before the pilot. The glossary architecture post covers the initial seeding methodology; the glossary maintenance workflow post covers ongoing management.

Mistake 7: Threshold set after seeing the results

Described above in the go/no-go threshold section, but worth restating here as a distinct mistake category: teams that evaluate pilot results without a pre-committed threshold almost always approve results that would fail any reasonable specification. The retrospective framing — “the results came in at 93%, and 93% is within our acceptable range” — is applied after the emotional and commercial investment in the vendor has already been made. Pre-commitment to the threshold eliminates this bias.

Data handling and BAA during the pilot

The pilot period introduces the same data handling requirements as the production engagement. Employee audio is processed by the vendor, which may involve transmission to third-party servers, temporary storage in the vendor’s infrastructure, and human review of the audio content. These handling steps have legal implications that apply equally during the pilot and during production.

Do not waive the BAA for the pilot period

The most common data handling mistake in caption vendor pilots is waiving the Business Associate Agreement (BAA) requirement “just for the pilot.” The logic applied is that the pilot is short-term, exploratory, and uses a small content sample. The legal reality is that if the pilot content contains any spoken references to Protected Health Information — which healthcare training content almost certainly does, given that clinical procedure videos, EMR documentation training, and medication administration training routinely include patient care scenarios — the pilot vendor is a business associate under HIPAA from the moment they receive the first file.

If the pilot vendor is not willing to execute a BAA before the pilot begins, they are not eligible as a production vendor for healthcare content. Discovering this disqualifier after the pilot rather than before wastes the pilot period and the reference transcript investment.

GDPR Article 28 data processing agreement

For organisations processing content that involves EU data subjects — which includes any training content produced by employees in EU member states or any training programme delivered to employees in EU jurisdictions — the vendor must execute a Data Processing Agreement (DPA) under GDPR Article 28 before the pilot begins. The DPA must cover the purposes of processing (producing captions), the categories of data subjects (employees), the categories of personal data (voice recordings, which are personal data under GDPR Article 4(1)), the duration of processing, and the vendor’s obligations as a data processor.

A vendor that provides a BAA template but does not have a GDPR Article 28 DPA template is a vendor whose compliance infrastructure may not be adequate for EU-regulated operations. Request both documents at the start of the procurement process, not during the pilot period.

Synthetic content as a pre-agreement pilot option

Some teams use synthetic content — demonstration videos produced specifically for evaluation purposes, without real employee audio — as a way to run a preliminary pilot before executing the BAA or DPA. Synthetic content pilots are appropriate when:

The content type allows accurate synthetic production (instructional narration by a professional voice actor, for example)
The synthetic content includes the same specialist vocabulary density as the production content it represents
The team understands that synthetic content produced by a professional narrator may score differently from authentic content produced by subject-matter experts

Synthetic content pilots are not appropriate as the sole pilot mechanism for content with high acoustic variability (diverse speaker populations, varied recording conditions). They are useful as a screening step before legal agreements are executed, not as a substitute for a pilot on real production content.

Data retention and deletion after the pilot concludes

Include a data retention and deletion clause in the pilot agreement. Specify:

How long the vendor retains pilot content (typically 30 days post-pilot)
The deletion process (confirmed in writing)
Whether sub-processors receive pilot content (and if so, which sub-processors and their deletion timelines)

If the pilot does not result in a contract, the vendor should not retain your diagnostic content indefinitely. That content represents your hardest vocabulary, which is also some of your most sensitive training material. The deletion requirement should be written into the pilot agreement regardless of outcome.

Vertical-specific pilot design variations

The general pilot design applies across content types. Specific verticals have additional requirements that change the pilot corpus composition, the reference transcript production process, or the threshold calibration.

Healthcare

Healthcare training video contains three vocabulary failure categories that must be represented in the pilot corpus: drug names (generic and brand, including newer biologics and novel drug classes), clinical procedure names (surgical procedures, diagnostic tests, imaging modalities, therapeutic interventions), and regulatory citation vocabulary (CMS condition codes, Joint Commission standard references, ICD-10 codes in clinical documentation training). The medical training captions post covers the specific vocabulary failure patterns.

For a healthcare pilot, at minimum 50% of the corpus should come from content covering these three categories: medication administration training, clinical procedure training, and documentation compliance training. The reference transcript for healthcare content must be reviewed by a clinical SME with domain knowledge of the specific drug names, procedure names, and regulatory citations in the content. The vocabulary accuracy floor should be applied separately to each of the three vocabulary categories to identify whether failure is concentrated in drug names, procedure names, or regulatory citations.

Financial services

Financial services training has the highest proper-noun ASR failure rate of any corporate L&D vertical, as documented in the financial services captioning post. The pilot corpus must include Series licensing content (Series 7, Series 63/66, Series 65), AML/BSA training content, Reg-BI training content, and ERISA/fiduciary content for wealth management programmes. The vocabulary accuracy floor for financial services pilots should be applied to the regulatory citation category specifically — citations like FINRA Rule 3110, SEC Rule 17a-4(f), and 23 NYCRR 500.14 must be verbatim accurate, and the pilot should document the error rate on regulatory citation strings separately from the general word error rate.

For broker-dealers subject to FINRA Rule 3110, the pilot documentation becomes part of the procurement record for the supervisory system. Retain the pilot corpus content list, reference transcripts, vendor output, and DCMP scoring spreadsheet as part of the Rule 3110 supervisory review file for the caption programme.

Engineering and manufacturing

Engineering and manufacturing training vocabulary includes LOTO (lockout/tagout) procedure steps, equipment model names and serial number formats, safety standards citations (OSHA 29 CFR 1910.147, ANSI Z244.1, ISO 14120), and manufacturing process vocabulary (statistical process control terminology, tolerance specifications, ISO quality standard references). For manufacturing safety content — LOTO procedures, confined space entry, PPE requirements — a single-word substitution error in a procedure step can create a safety training documentation problem: the captioned procedure may differ from the audio-delivered procedure in a way that leaves the correction unresolved if the employee is relying on captions.

Manufacturing pilot corpus composition should include at least 30% equipment operation procedure video (LOTO, confined space, PPE fitting) and at least 20% quality/compliance procedure video (ISO 9001, Six Sigma, statistical process control). The vocabulary accuracy floor applies to equipment and procedure vocabulary specifically.

Cybersecurity

Cybersecurity training vocabulary includes framework acronyms (MITRE ATT&CK, NIST CSF, ISO 27001, SOC 2, FedRAMP), threat terminology (APT, C2, lateral movement, privilege escalation, phishing, spear phishing, watering hole), tool names (Metasploit, Wireshark, Burp Suite, Nessus), and certification names (CISSP, CISM, CompTIA Security+, CEH). The cybersecurity training captioning post covers this vocabulary profile. For cybersecurity pilot corpus design, include content covering at least three distinct vocabulary categories: framework/standard citations, threat actor TTPs, and tool names.

Eight failure modes in caption vendor pilot programme design

Corpus is not diagnostic. The test corpus is majority soft-skills or general management content, producing aggregate scores 8–12 percentage points above what the vendor will achieve on technical and compliance content in production. The pilot passes; production accuracy fails to meet the threshold.
No reference transcript produced before submission. Scoring is done by subjective read-through while watching the video, which overestimates accuracy by 8–15 percentage points compared to DCMP Captioning Key scoring. The pilot data is not comparable across vendors or across pilot periods and cannot be carried into the contract SOW as a documented accuracy baseline.
Go/no-go threshold set after seeing the results. The threshold is written to rationalise an approval that has already been made on commercial or relationship grounds. The threshold that would have prevented approval on fair criteria is never articulated.
Vendor pre-corrects output for the pilot. The service tier used in the pilot (e.g., certified human review) is not the tier priced in the contract (e.g., AI-assisted). Production accuracy is materially below pilot accuracy from the first week of the engagement.
Glossary not configured before the pilot. The unconfigured base ASR model produces scores 5–15 points below the configured production service. The pilot either fails unfairly (a vendor who would meet the threshold with glossary configuration fails without it) or passes for a vendor who cannot maintain the threshold on production content with the same glossary.
BAA or DPA not executed before the pilot begins. Healthcare or EU-regulated content is processed by the vendor without a Business Associate Agreement or GDPR Article 28 Data Processing Agreement. The organisation is in breach of HIPAA or GDPR from the first file submitted. The pilot reveals a vendor who cannot or will not execute the required legal agreements before delivering service.
Timing accuracy not checked. The pilot evaluation counts word errors only. The vendor has excellent word accuracy but systematic timing displacement of 3–5 seconds. The DCMP score, if timing errors had been counted, would have been below the aggregate floor. The timing problem surfaces in production when employees report that captions are out of sync with the speaker, but by then the contract is signed.
Pilot-to-SOW handoff is informal. The pilot produces documented accuracy data but this data is not explicitly carried into the contract SOW as an accuracy SLA. The vendor’s contract proposes a softer accuracy commitment than the pilot result, and the buyer signs without reconciling the gap. In production, the vendor meets their contracted SLA (which is below the pilot threshold) while the buyer expected the pilot result to be the de-facto standard.

Pilot-to-contract bridge

The pilot produces several types of data that should be explicitly carried into the contract SOW. This is the mechanism that converts a pilot from an exploratory exercise into procurement leverage. The vendor SLA and contract review checklist post covers the full set of contract terms; this section focuses specifically on the terms that derive from pilot data.

Accuracy SLA derived from the pilot aggregate score

If the vendor achieves 97.4% aggregate DCMP accuracy on the pilot corpus, the contract SLA should specify 97% aggregate DCMP accuracy as the minimum production threshold. Do not negotiate from the pilot result upward (asking for a 99% SLA when the pilot showed 97.4%) — this creates a contractual commitment the vendor demonstrated they cannot reliably meet. Negotiate from the pilot result downward by a small margin (1–1.5 percentage points) to account for production variability, and confirm that the methodology (DCMP Captioning Key) and the content-type sampling requirement (quarterly QA spot-check on diagnostic content, not representative content) are specified in the SLA.

Glossary baseline derived from the pilot configuration

The glossary configuration used in the pilot is the minimum baseline for the production engagement. Include a clause in the contract specifying that the vendor will maintain the glossary version used during the pilot as the minimum configuration, and that new terms submitted by the organisation must be incorporated within a defined SLA (typically 48 hours for urgent terms, 5 business days for standard terms). The glossary maintenance workflow post covers the operational process for ongoing term management; the contract clause specifies the vendor’s response obligations.

Vocabulary accuracy floor on specialist content

If the pilot included a vocabulary accuracy floor assessment on specialist content, include the vocabulary accuracy SLA as a separate contract term from the aggregate accuracy SLA. A contract that specifies only an aggregate accuracy floor allows the vendor to average up their score on easy content against a poor performance on specialist content and still claim SLA compliance. A separate vocabulary accuracy SLA prevents this.

Scoring methodology specification

Specify DCMP Captioning Key as the accuracy measurement methodology in the contract. The SOW clause should read approximately: “Accuracy shall be measured using the DCMP Captioning Key error-rate formula (WER = (S+D+I+T)/N × 100, where S = substitutions, D = deletions, I = insertions, T = timing errors >2 seconds, and N = total captionable words in the reference transcript), applied to a random sample of content submitted in the preceding measurement period.” A contract that specifies an accuracy percentage without specifying the measurement methodology gives the vendor a contractual escape on any accuracy dispute.

Remediation trigger and process

If the pilot identified specific failure modes — a particular content type, speaker, or audio condition where accuracy fell below the outlier threshold — include a remediation clause in the contract that triggers automatically if the same failure mode recurs in production QA. The caption programme annual review post covers the QA cadence that surfaces these failure modes; the contract clause specifies what the vendor must do when they are surfaced.

FAQ

How long should a caption vendor pilot last before we can make a contract decision?: The answer depends on whether you are running a corpus pilot, a parallel run period, or both. A corpus pilot — a controlled test on a pre-selected diagnostic content set — produces results immediately and does not have a duration in the conventional sense: the pilot is complete when the vendor delivers output for all files in the corpus and your team has completed DCMP scoring. This can be as short as two weeks from corpus submission to scored results. A parallel run period, which tests steady-state production accuracy over time, should run for at least 4 weeks and ideally 8 weeks to capture production volume sufficient for statistical stability. For most procurement decisions, running the corpus pilot first (2 weeks), using the results to either proceed or identify a need for configuration adjustment, and then running a 4–6-week parallel run if the corpus pilot passes is the most efficient sequence. Total elapsed time from first file submission to contract decision: 6–10 weeks.
Can we use the vendor’s sample output package instead of running our own pilot?: No. A vendor sample is marketing material, not procurement evidence. The vendor selects which content to demonstrate on, optimises their configuration for that content, and may apply additional human review before delivery. The resulting accuracy figure is not predictive of what they will achieve on your diagnostic content under standard production conditions. This distinction is documented in detail in the vendor accuracy evaluation methodology post. If a vendor’s proposal includes impressive sample accuracy figures and they resist the idea of a controlled pilot on your content, that resistance is itself a meaningful signal about what a pilot would show.
What if the vendor scores 94% on our pilot corpus and we really need to move forward due to a compliance deadline?: A 94% aggregate DCMP score on a diagnostic pilot corpus predicts production accuracy below 97% on your technical and compliance content — which is materially below the 99% WCAG 2.1 AA threshold. Moving forward under deadline pressure is a risk decision, not a procurement decision. If you proceed, you should do three things. First, negotiate a remediation SLA into the contract that gives the vendor a defined period (typically 60–90 days) to bring accuracy to 97% on a retest, with a contract exit clause if they do not. Second, document the decision as a time-limited exception with a specific retest date, per the exception procedure described in the compliance programme build post. Third, include a glossary optimisation sprint in the first 30 days of the engagement specifically targeting the vocabulary failure modes identified in the pilot, per the feedback loop post. A 94% pilot result with these three mechanisms in place is a managed risk; a 94% pilot result that is simply approved and forgotten is a production accuracy problem waiting to be discovered by an OCR investigation.
Should we tell both vendors in a parallel run that they are being evaluated against each other?: Yes, for both practical and ethical reasons. Practically: if you do not tell the new vendor that an incumbent is also processing the same content, they may notice format inconsistencies or discover the parallel run another way, creating a trust problem. Ethically: a pilot where one vendor knows they are being evaluated and another does not is not a controlled comparison. Tell both vendors that you are running a parallel evaluation, specify that both will receive the same content over the same period, and that the results will be scored by the same methodology. Most vendors accept this transparently; a vendor who refuses to participate in a formally disclosed parallel evaluation is telling you something about their confidence in their accuracy relative to alternatives.
We don’t have time to produce a reference transcript for the full pilot corpus. Can we score a subset?: Scoring a subset is acceptable with two conditions. First, the subset must be at least 60 minutes of content across at least 10 distinct files — below this, the per-video distribution is too thin to support the outlier threshold assessment. Second, the subset selection must apply the same diagnostic content principle as the full corpus: at minimum 40% technical/specialist vocabulary content in the subset. If you are constrained on time for reference transcript production, prioritise producing complete reference transcripts for the specialist vocabulary files rather than splitting transcript effort across the full corpus. The specialist vocabulary subset is where the go/no-go decision will actually be made; a pilot that has full DCMP scores on the specialist content and only subjective review for the soft-skills content produces a more useful decision basis than one that applies DCMP scoring to soft-skills content and subjective review to specialist content.
How do we handle PHI in healthcare training content during the pilot?: The correct approach is to execute the BAA before the pilot begins, as described in the data handling section above. If the BAA negotiation will delay the pilot by more than two weeks, a practical alternative is to produce a synthetic version of the most vocabulary-intensive healthcare content for the pilot: use a professional narrator to record a scripted version of a medication administration or clinical procedure training module that contains the same vocabulary density as the authentic production content but no actual patient care references. The synthetic version can be processed by the pilot vendor before BAA execution. Once the BAA is executed, run a supplementary corpus of authentic content to confirm that the synthetic pilot result transfers to authentic audio conditions. This two-stage approach adds 1–2 weeks to the pilot timeline but resolves the compliance tension between pilot speed and BAA requirement without waiving either.
At what point in the procurement cycle should the pilot occur relative to price negotiation?: The pilot should occur after the vendor has provided a pricing proposal but before the price is finalised in a contract. The sequencing: (1) RFP shortlisting and vendor selection (the RFP playbook covers this), (2) preliminary pricing proposal from the chosen shortlist, (3) pilot on your diagnostic corpus, (4) contract negotiation using the pilot result as leverage on both accuracy SLA and price, (5) contract signature. Running the pilot after price is finalised but before signature means you have pilot data that informs the SLA negotiation. Running the pilot before preliminary pricing means you may invest in a pilot for a vendor whose price is outside your budget; running it after contract signature means you have no leverage. The vendor’s knowledge that a pilot result showing below-threshold accuracy will delay or prevent contract signature is itself a quality incentive during the pilot period — which is exactly the structure you want.