Vendor Evaluation · Published 2026-06-14

How to test a captioning vendor's accuracy before signing: test corpus design, content type sampling, DCMP-protocol scoring, and the red flags that predict live-environment failure

There is a gap in the captioning procurement process that most L&D teams discover after they have signed a contract. The RFP identifies a shortlist of vendors that meet the procurement criteria — glossary capability, BAA availability, LMS integrations, pricing within budget, references from comparable organizations. The contract review establishes the legal framework: SLA terms, accuracy guarantees, data handling provisions, termination conditions. What neither step provides is empirical evidence that the vendor's output will actually achieve the accuracy threshold your compliance framework requires when processing your training content, in your industry vertical, with your organizational vocabulary. The gap between "we selected this vendor through a rigorous RFP" and "we know this vendor will produce 99% DCMP-accurate captions on our HIPAA compliance training" is the accuracy evaluation gap — and it is the gap where most captioning programme failures originate.

The accuracy evaluation problem is harder than it looks. The obvious approach — ask each shortlisted vendor to process a sample video, then read through the output and judge whether it "looks right" — fails in predictable ways. Human judgment of caption quality without a structured measurement protocol systematically misses the error categories that compliance auditors measure. A reviewer reading a caption file will notice obvious substitutions ("kidney" for "kidney dialysis") but will miss low-frequency deletions (a technical term present in the audio but absent from the caption), insertion errors (filler words that are not in the audio appearing in the caption), and synchronization drift that is only visible when you time-check the caption file against the audio timestamp. The DCMP Captioning Key uses a word-level measurement protocol specifically because informal review consistently overestimates caption quality by 5–12 percentage points. An evaluator who reads through a caption file for an engineering onboarding video and concludes "this is about 95% accurate" is typically looking at a file that scores 84–87% on a DCMP-protocol measurement. The accuracy evaluation methodology in this post closes the perception gap with a structured process.

The methodology is also harder than it looks because vendor accuracy varies not just between vendors but across content types within a single vendor's output. A vendor that produces 97% DCMP accuracy on soft-skills training content may produce 83% accuracy on the same organization's technical onboarding content — because the accuracy gap on technical content is driven by organizational vocabulary that is not in the vendor's base model, not by general ASR competence. An evaluation that tests only soft-skills training will return a score that says "this vendor meets our threshold" for a vendor that will fail on every technical training module published after contract signature. The test corpus design — the choice of what content to include in the pre-contract evaluation — is the foundational skill, and it is the step that most procurement evaluations get wrong by selecting representative content (what the organization publishes most) rather than diagnostic content (what will reveal the accuracy ceiling on the hardest vocabulary the organization uses).

This post covers the full pre-contract accuracy evaluation methodology: why accuracy testing is a distinct workflow from the RFP and from ongoing QA, how to design a test corpus that diagnoses real accuracy ceiling rather than confirming comfortable averages, the content-type sampling protocol that produces statistically defensible comparisons across vendors, the DCMP-protocol scoring process step by step, how to submit content to vendors without creating conditions that inflate evaluation results, how to interpret scores across content types and weight results by compliance relevance, the eight red flags in evaluation output that predict live-environment failure even when the headline accuracy score passes, eight failure modes in the evaluation process itself, and a seven-question FAQ on the decisions that L&D procurement teams face most often. The RFP playbook covers the full procurement process from trigger to shortlist; the contract review checklist covers what to sign; this post covers the accuracy evaluation that happens between shortlist and signature.

TL;DR — three things that determine whether your vendor evaluation is valid

The test corpus must contain your hardest content, not your most representative content. An evaluation designed to confirm a preferred vendor will use content the vendor is likely to handle well — conversational training, soft-skills instruction, executive messaging with mainstream vocabulary. An evaluation designed to measure real accuracy ceiling will include the content where technical vocabulary failure occurs: engineering onboarding with platform-specific terminology, clinical compliance training with drug names and regulatory citations, product certification with current SKU names, safety training with OSHA citation codes. If the hardest content in your library is not in the test corpus, the evaluation does not tell you what you need to know. The vendor's score on soft-skills content predicts nothing about the vendor's score on your HIPAA compliance module or your engineering onboarding series. The evaluation is only as diagnostic as the content it tests.
DCMP-protocol scoring is the only measurement method that produces compliance-relevant results. Qualitative review ("this looks about right"), corpus-level word error rate (WER on the full file), and subjective accuracy rating ("I'd call this 90%") all systematically overestimate caption quality for technical training content. The DCMP Captioning Key specifies word-level measurement on sampled passages: count every word in the audio, count every error in the caption (substitution, insertion, deletion, formatting), and compute accuracy as (correct / total audio words) × 100. This measurement method produces the number that OCR enforcement applies — and it consistently produces lower scores than informal review on the same files. Running DCMP-protocol scoring on your test corpus requires a reference transcript (a ground-truth word-for-word record of what the speaker says), which must be prepared before sending content to vendors. Without a reference transcript, you cannot run DCMP scoring. Preparing the reference transcript is the most time-consuming step of the evaluation and the one most frequently skipped — which produces qualitative review in place of measurement and invalidates the evaluation.
Headline accuracy score is necessary but not sufficient — the red flags are what vendors don't control. A vendor that scores 97% on your general training content and 91% on your technical training content has demonstrated an accuracy cliff on domain vocabulary that will expand as your content library grows and your vocabulary becomes more specific. A vendor that scores 96% on accuracy but produces synchronization drift of 3–4 seconds on fast-paced technical content fails the DCMP synchronization requirement even if the words are right. A vendor that cannot explain their measurement methodology, that asks you to use their sample content instead of yours, or that delivers test results faster than the stated production SLA has revealed more about their actual production behaviour than their accuracy score has. The red flag analysis runs alongside the scoring and often produces the most useful vendor differentiation.

Why vendor accuracy testing is a distinct skill from RFP and ongoing QA

Three distinct evaluation activities occur at different points in the captioning vendor lifecycle, and they are frequently conflated to the detriment of all three. The RFP process — covered in detail in the captioning RFP playbook — is a structured procurement process that identifies which vendors qualify for the shortlist based on capability, compliance posture, pricing, and references. The RFP process does not produce empirical accuracy data on your content; it produces capability attestations and references that tell you the vendor can, in principle, meet the accuracy standard. The ongoing QA process — covered in the caption QA methodology post — is a recurring post-deployment measurement process that verifies caption accuracy on content published to your production LMS after contract signature. Ongoing QA tells you whether the vendor is maintaining the accuracy standard on your live content library. The pre-contract accuracy evaluation is neither of these. It is a structured empirical test of vendor output on your specific content types, run before contract commitment, using the same measurement protocol (DCMP word-level scoring) that compliance auditors apply.

What the RFP does and does not tell you

The RFP process answers qualification questions: Does this vendor have a Business Associate Agreement for healthcare customers? Does their model support glossary customization? Do they integrate with Kaltura, Panopto, or your LMS via API? Do their reference customers have comparable content profiles? Have they passed SOC 2 Type II? Can they meet your turnaround SLA at your video volume? These are necessary questions whose answers determine whether a vendor belongs on the shortlist. They do not tell you anything empirical about what the vendor's output will look like on your engineering onboarding module or your OSHA safety training. A vendor with strong RFP responses may produce poor accuracy on your technical content because their base model has not been exposed to your domain vocabulary and their glossary customization workflow is not as effective as their RFP response implies. RFP responses are capability attestations; pre-contract accuracy evaluation is capability verification.

What ongoing QA does and does not tell you

Ongoing QA — running DCMP-protocol spot-checks on caption files that have been published to your production LMS — provides accurate measurement of vendor performance on live content. The problem is timing: ongoing QA provides data after you have signed a contract, paid for onboarding, integrated the vendor into your production workflow, and begun publishing content. If the QA data at 90 days reveals that the vendor is producing 84% DCMP accuracy on your technical training content — below the 99% threshold — you are now in a difficult position: you have contractual obligations, a production workflow dependency, and learners who have been consuming inaccurate captions for the past 90 days. The cost of discovering accuracy failure through ongoing QA is vastly higher than discovering it through pre-contract evaluation. Pre-contract evaluation exists precisely to shift the discovery point from post-deployment to pre-commitment.

The evaluation window in the procurement timeline

Pre-contract accuracy evaluation typically occurs between RFP scoring (which produces the shortlist) and contract negotiation (which produces the signed agreement). The timing matters: you need a shortlist to know which vendors to evaluate, but you need evaluation results before you can negotiate an accuracy guarantee with any confidence. The evaluation window is typically two to four weeks — long enough to design a test corpus, prepare reference transcripts, submit to vendors, receive results, run DCMP scoring, and interpret findings, but short enough to keep the procurement timeline from stalling. Compressed evaluations that skip reference transcript preparation or reduce the test corpus to one or two clips produce unreliable results that are not worth the time spent collecting them. If the evaluation window is too short to run a proper test, it is better to ask vendors for a formal pilot after contract signature with defined go/no-go criteria than to run a compressed pre-contract evaluation that creates false confidence.

How evaluation connects to the accuracy guarantee in the contract

Pre-contract evaluation results have a second function beyond vendor comparison: they establish the empirical baseline that makes an accuracy guarantee in the contract enforceable. A contract clause that says "vendor will provide captions at 99% word accuracy as measured by the DCMP Captioning Key" is only meaningful if there is a documented measurement protocol and a reference content set against which accuracy can be verified. Organizations that have run a pre-contract evaluation have both: the DCMP-protocol process they used to measure evaluation results is the same process that defines the accuracy guarantee, and the evaluation corpus (or a comparable set) provides the reference content for future QA measurement. Organizations that sign contracts with accuracy guarantee language but no documented measurement methodology cannot enforce the guarantee because they cannot demonstrate, using a mutually agreed measurement protocol, that the vendor's output fails the standard. The vendor contract review checklist covers the specific contract language that makes accuracy guarantees enforceable; pre-contract evaluation is what gives you the measurement methodology evidence to negotiate that language from a position of knowledge.

Designing the test corpus

The test corpus — the set of video files you send to vendors for evaluation — is the most consequential design decision in the accuracy evaluation process. A poorly designed corpus produces results that are accurate but not diagnostic: accurate in the sense that DCMP scores reflect real accuracy on the content tested, but not diagnostic in the sense that the results don't tell you whether the vendor will work on the content that matters most for your compliance risk profile. The test corpus design principle is: build the corpus to diagnose, not to confirm. The evaluation should reveal the accuracy ceiling on your hardest content, not the comfort zone performance on your most common content.

The foundational design decision: diagnostic vs. representative

Most L&D teams, when asked to "send a sample of your training content" for an evaluation, send a representative sample: the most popular videos, the most recent content, the content they feel best represents the library. Representative sampling is appropriate for some purposes (estimating average QA burden, pricing per-minute volume agreements). It is the wrong design for accuracy evaluation because representative sampling overweights the content where vendor accuracy is highest (soft-skills training, general corporate communications, management development) and underweights the content where vendor accuracy failures occur (technical certification, domain-specific compliance training, product onboarding with current SKU vocabulary, clinical procedure training with drug names).

Diagnostic sampling inverts this. The diagnostic corpus is weighted toward the content types where vocabulary failure is most likely and where compliance risk is highest — not because you expect these videos to represent the volume of content the vendor will process, but because these videos reveal the accuracy ceiling. A vendor that scores 93% on your OSHA safety training corpus and 97% on your soft-skills corpus is revealing, through that gap, that their model has a domain vocabulary problem that will manifest on all your technical and regulated content. You will not discover this with a representative sample that is 80% soft-skills content and 20% technical content weighted by the same ratios as your library.

Content category selection

A well-designed test corpus includes at least one clip from each of the distinct content categories your library contains, with heavier representation for the highest-risk categories. The following framework organizes content by ASR difficulty and compliance risk:

Content category framework for test corpus design
Category	ASR difficulty	Compliance risk	Recommended clips in corpus	Why include it
Technical / engineering onboarding	High — platform names, command syntax, version strings, architecture terms	High (ADA Title I / Section 508)	2–3	Reveals domain vocabulary ceiling; proper noun failure rate
Clinical / medical compliance training	Very high — drug names, dosages, procedure codes, anatomy, regulatory citations	Very high (Section 504 / HIPAA-adjacent)	2–3 if applicable to your vertical	Hardest ASR problem in L&D; DCMP score on clinical content is the worst-case floor
Regulatory / compliance training	High — OSHA citation codes, chemical names, regulatory body names, legal definitions	Very high (mandatory content — highest obligation)	2	Mandatory content with highest compliance obligation; vocabulary dense
Product / sales enablement training	High — current product names, SKU identifiers, platform-specific terms, competitor names	Medium-high (ADA Title I)	1–2	Fastest-changing vocabulary category; reveals how quickly the vendor's model goes stale
Soft-skills / leadership development	Low–medium — conversational English, organizational vocabulary varies	Medium	1	Baseline control; if vendor fails here, disqualify immediately
Executive communications / all-hands	Medium — organizational vocabulary, names, strategy language	Medium (ADA Title I employer communications)	1	Common in large organizations; organizational proper noun density reveals name handling
Onboarding / company-specific orientation	High — internal names, role titles, process names, system names	Medium-high	1	Maximum organizational proper noun density; tests vendor's handling of your specific naming conventions

The "hardest clip" and "easiest clip" anchors

Every test corpus should include two anchor clips that bookend the accuracy spectrum. The "hardest clip" is the video in your library most likely to break an ASR model: highest technical vocabulary density, most organizational proper nouns, fastest speech pace, any significant background noise or multiple speakers. For a financial services L&D team, this might be a derivatives product certification video with regulatory citation codes and instrument names. For an engineering org, it might be a Kubernetes deployment walkthrough with command-line syntax read aloud. Include this clip in every vendor evaluation. The vendor's score on the hardest clip is the floor prediction for your live production environment as the library grows into more technical content. The "easiest clip" is the video most likely to produce high accuracy from any vendor: a professionally recorded, single-speaker soft-skills training video with no jargon. Include this clip too. If a vendor scores below 97% on your easiest clip, the evaluation is over — do not advance to contract discussion regardless of their RFP performance. A vendor that cannot achieve near-ceiling accuracy on general conversational training content will not achieve 99% on your technical content.

Clip length and the "within-clip" variation problem

Individual clips for evaluation should run 8–15 minutes each. Clips shorter than 5 minutes provide insufficient sample size for DCMP scoring — you need at least 1,000 words per clip to get a statistically meaningful accuracy measurement, and short clips may not contain enough technical vocabulary for the diagnostic signal to emerge. Clips longer than 20 minutes increase the reference transcript preparation burden without proportional accuracy information gain. The 8–15 minute range is the sweet spot for pre-contract evaluation. One technical concern with longer clips: accuracy often varies significantly within a clip based on vocabulary density. A 15-minute engineering onboarding video may have 95% accuracy during the introduction (conversational, low vocabulary density) and 81% accuracy during the technical walkthrough section (high vocabulary density). Including a variety of clip lengths and noting within-clip variation patterns provides more diagnostic signal than averaging across the full clip.

Content you should not include in the evaluation corpus

Several content types that might seem obvious to include should be excluded or treated as supplementary rather than primary evaluation criteria. Previously captioned content — videos that already have a reviewed SRT file — should not be included, because vendors with glossary capability may use your existing caption file as a vocabulary reference and produce artificially inflated accuracy scores. Confidential or proprietary content that you would not be comfortable sharing with a vendor whose contract you have not signed should be excluded; the evaluation corpus should be drawn from content you are willing to share without a signed NDA or with an NDA in place but without discomfort about the specific content. Synthetic or scripted content that reads from a prepared text should be treated cautiously — some vendors produce better results on read-aloud scripted content than on naturally recorded instruction, and evaluating only scripted content may overestimate accuracy on naturally delivered lecture and demonstration content that is common in L&D libraries.

Content-type sampling protocol

Having selected the content categories and individual clips for the evaluation corpus, the next design decision is how to organize the submission for multi-vendor comparison. The goal of the sampling protocol is to ensure that every vendor processes identical content so that accuracy differences in the results reflect vendor performance, not corpus differences. This sounds obvious but is violated more often than not: evaluations where each vendor receives a slightly different subset of the corpus, where some vendors receive clips with audio quality issues and others receive professionally produced content, or where vendors receive clips at different stages of the selection process produce results that cannot be compared across vendors.

Identical corpus, identical conditions

Every vendor on the shortlist must receive the same set of clips, exported from the same source with the same audio codec and bit rate, with the same file naming convention, and submitted through the same submission mechanism (either all via the vendor's API, or all via the vendor's file upload interface, or all via email — but not a mix). Audio quality variation is the most common source of artificial vendor differentiation in multi-vendor evaluations: if one vendor receives a clip that was exported at higher bitrate or with better noise reduction, their accuracy score will be higher for reasons unrelated to their model quality. Standardize the export settings before submitting to any vendor. For training content that was originally recorded at multiple quality levels, use the median quality standard — not the best-quality clips, which may not represent live production conditions.

Blind submission vs. labeled submission

The question of whether to label the content ("this is engineering onboarding," "this is compliance training") before submission is a genuine evaluation design decision. Labeled submission allows vendors with glossary capability to pre-build domain glossaries for the evaluation content — which produces better evaluation scores but may not reflect what would actually happen in a production onboarding where the glossary needs to be built from scratch. Unlabeled submission tests the vendor's base model performance without glossary augmentation, which reveals the accuracy floor but underestimates what the vendor can achieve with a properly configured glossary. The recommended approach for a two-phase evaluation: run an unlabeled submission first (tests base model accuracy), then run a labeled submission with time for glossary configuration (tests optimized accuracy). The gap between the two scores reveals how much glossary customization is contributing to the vendor's accuracy and how quickly they can build and configure that glossary. A vendor that jumps from 84% to 97% with glossary configuration has demonstrated that the glossary is doing critical work — which means their glossary onboarding quality and timeline are now critical evaluation dimensions.

Volume calibration

The evaluation corpus should total 60–120 minutes of audio across all clips. Below 60 minutes, you have insufficient sample size to detect systematic patterns across content types — a vendor with a specific failure mode on proper nouns in regulatory content may not hit that vocabulary enough times for the failure to register in the DCMP score. Above 120 minutes, the reference transcript preparation burden becomes large enough to cause teams to skip or abbreviate the process, which eliminates the DCMP scoring and reduces the evaluation to qualitative review. The 60–120 minute range provides enough sample size for statistically meaningful results while remaining manageable to transcript-prepare and score within a two-week evaluation window. For the multi-vendor shortlist evaluation, this volume applies once: every vendor receives the same 60–120 minutes of content, not 60–120 minutes per vendor.

Controlling for audio quality variation within the corpus

Within a single organization's training library, audio quality varies significantly. Professionally produced content recorded in a studio or professionally equipped office with noise-isolated microphones will score 8–15 percentage points higher on any ASR model than content recorded with a laptop microphone in a home office with background noise, air conditioning, or echo. The home-office audio problem is documented in the remote and hybrid workforce captioning post — it is a pervasive problem for post-2020 training libraries where content production is distributed. For accuracy evaluation, include one clip with professional recording quality and one clip with typical distributed-production quality (home office audio). The gap between the two clips' scores reveals how the vendor's model handles real-world audio quality variation, which matters because your live production content will include both quality levels. A vendor whose score drops 18 percentage points between professional and home-office audio will struggle with the real-world content mix. A vendor who uses audio enhancement pre-processing and drops only 7 percentage points is a different risk profile.

Preparing reference transcripts

The reference transcript is a word-for-word written record of everything the speaker says in each evaluation clip, prepared independently of any vendor output and before the content is submitted for evaluation. It is the ground truth against which vendor caption output is measured in DCMP scoring. Without a reference transcript, DCMP scoring cannot be performed. With a reference transcript, DCMP scoring is a structured counting exercise: compare the vendor's caption output word by word to the reference transcript, mark each error type (substitution, insertion, deletion, formatting), count total audio words, compute accuracy. The reference transcript preparation step is the most time-consuming part of the evaluation process and the step most frequently eliminated — which is also why most pre-contract evaluations produce qualitative impressions rather than DCMP measurements.

How to prepare a reference transcript

The reference transcript should be prepared by a human reviewer who listens to the audio and types a verbatim record of what is spoken. Verbatim means every word the speaker says, including false starts, corrections, and filler words (um, uh) that the speaker does not clearly intend as meaningful words — because the DCMP measurement requires you to count every audio word to compute the accuracy denominator, and any word you exclude from the reference transcript will reduce your denominator incorrectly. Common sources of reference transcript error that invalidate the measurement: editing out pauses and filler words (reduces denominator and inflates apparent accuracy), paraphrasing instead of transcribing verbatim (changes the comparison baseline), correcting speaker errors (the vendor is measured against what the speaker actually said, not what they intended to say), and skipping technical terms that are hard to spell (precisely the terms that will fail in vendor output — they must be in the reference transcript with correct spelling).

The reference transcript should include speaker identification markers when the clip contains multiple speakers (e.g., "[INTERVIEWER]:" and "[SUBJECT MATTER EXPERT]:"). It should include timestamps at defined intervals — every 5 minutes is sufficient for synchronization checking during DCMP scoring. It should not include caption formatting or line-break notation — those are irrelevant to word-level accuracy measurement and add preparation overhead without evaluation benefit. The reference transcript format is a plain text document, one line per speaker turn (or one line per sentence for single-speaker content), with timestamps every 5 minutes. Total preparation time estimate: 3–4 hours per 10 minutes of evaluation content for a fast transcriber, 5–6 hours for a careful transcriber unfamiliar with the technical vocabulary. For a 90-minute evaluation corpus, budget 30–55 hours of reference transcript preparation. This is why the evaluation window needs to be at least two weeks and why the reference transcript preparation must begin before the vendor submission step, not after.

Using auto-transcription to accelerate reference transcript preparation

Reference transcript preparation can be accelerated (not replaced) by using an auto-transcription service to generate a first-pass transcript that a human reviewer then corrects. The correction pass is mandatory and cannot be abbreviated — the purpose of the reference transcript is to be a reliable ground truth, and an uncorrected auto-transcription contains the same errors that you will be measuring in vendor output. But the correction pass is typically faster (2–3 hours per 10 minutes for an experienced reviewer) than building the transcript from scratch. Important constraint: do not use the same ASR model or service for reference transcript preparation that you are planning to evaluate as a vendor — because the reference transcript errors will correlate with the vendor output errors and artificially inflate that vendor's accuracy score. If you are evaluating Whisper-based vendors, use a different model (Rev Human, human transcription service, or Google Speech-to-Text) for the reference transcript. If you are evaluating multiple competing services, prepare the reference transcript with a human-only transcription approach to avoid any correlation artifact.

Vocabulary preparation alongside the transcript

While preparing the reference transcript, compile a vocabulary list: every technical term, proper noun, acronym, organizational name, product name, role title, and regulatory citation that appears in the evaluation corpus. This list serves two purposes. First, it provides the term inventory for the labeled evaluation phase (vendors with glossary capability will use this list to pre-configure their model). Second, it allows you to run a term-specific accuracy analysis after scoring: for each vendor, compute accuracy on the vocabulary list terms separately from accuracy on the general vocabulary. Term-specific accuracy is the diagnostic signal that explains why one vendor's general accuracy score may be 94% while their accuracy on the terms that matter most for compliance risk is 78%. A vendor that scores 94% overall but 78% on domain vocabulary terms has a structural problem that will worsen as your content library grows and domain vocabulary density increases. The term-specific analysis is the differentiation that headline scores do not reveal.

Running the DCMP-protocol score

The DCMP scoring process converts vendor caption output and your reference transcript into a numeric accuracy measurement using the same protocol that compliance auditors apply. The output of the process is a percentage accuracy figure per vendor per content clip, comparable across vendors and defensible in a compliance context. The process requires: the vendor's caption file (SRT or VTT), your reference transcript for the same clip, a media player that allows you to play the audio while reading the caption file simultaneously, and a structured error-counting form.

Step 1: Prepare the scoring form

Before beginning the scoring session, create a structured form for tracking each error. The minimum error tracking fields are: timestamp (where in the clip the error occurs), error type (S for substitution, I for insertion, D for deletion, F for formatting), the audio word or phrase (what the speaker said), and the caption word or phrase (what the caption shows). A simple spreadsheet with one row per error works well. For the post-scoring analysis, the timestamp and type fields allow you to identify patterns (errors concentrated in the first two minutes suggest the model is calibrating; errors concentrated in proper-noun-heavy passages confirm vocabulary failure; errors distributed uniformly suggest general model degradation). Total audio word count must be tracked separately — count the total words in your reference transcript for the clip. This becomes the denominator in your accuracy calculation.

Step 2: Listen and mark errors simultaneously

Set up the scoring session with three elements running simultaneously: the audio clip playing at normal speed, the vendor's caption file displayed (ideally in a media player that shows the caption track in sync, or in a side-by-side text comparison), and the reference transcript open for word-by-word comparison. For each sentence or caption line, pause after listening and compare the spoken words (from the reference transcript) to the captioned words (from the vendor's output). Mark each discrepancy on the scoring form with the error type. For scoring efficiency, it is easier to compare in 30-second intervals rather than word by word — pause at 30-second intervals, check the reference transcript for that interval against the caption output, mark all errors in that interval, then advance. The 30-second interval approach reduces the cognitive load of simultaneous listening and reading while maintaining word-level accuracy measurement.

Error type definitions

The four DCMP error categories have precise definitions that must be applied consistently across all vendors and clips to produce comparable scores:

Substitution (S): A word that should appear in the caption (it is in the audio and the reference transcript) has been replaced with a different word. "Kubernetes" → "kubernetes" is a formatting error, not a substitution. "Kubernetes" → "Cubernetes" is a substitution. "PHI" → "file" is a substitution. Substitutions are the most common error type for technical vocabulary and proper nouns. Each substitution counts as one error toward the total error count.
Insertion (I): A word appears in the caption that is not in the audio. These are relatively rare in modern ASR output but occur when the model hallucinates a word or phrase, when it misidentifies ambient sound as speech, or when it adds punctuation words ("And," "But," "So") that the speaker did not speak. Each inserted word counts as one error.
Deletion (D): A word that is in the audio (and in the reference transcript) is absent from the caption. Common in fast speech, at sentence boundaries, and for low-frequency technical terms that the model was uncertain about. Deletions on technical terms are compliance-relevant because the omitted word is often the critical information — a drug name, a regulatory citation code, a safety procedure identifier. Each deleted word counts as one error.
Formatting (F): The word is correctly transcribed but a formatting error reduces comprehension. Categories include: incorrect speaker identification in multi-speaker content, missing sound effect description when a sound is essential to meaning, incorrect punctuation that changes meaning (missing period after a sentence that changes the meaning of the adjacent sentence), and incorrect capitalization that changes identification of a proper noun as a proper noun. Formatting errors are counted but typically weighted less heavily than substitution and deletion in practical evaluation — a correctly transcribed word with the wrong capitalization is better than a substituted word.

Step 3: Synchronization check

Run the synchronization check separately from word accuracy scoring. The DCMP standard requires captions to appear within two seconds of the corresponding audio. To check synchronization, use the timestamps in your reference transcript as anchor points. At each 5-minute reference transcript timestamp, compare the timestamp in the vendor's caption file for the same passage. If the vendor's caption for content at 5:00 in the audio appears at 5:04 in the caption file, you have a 4-second synchronization drift that fails the DCMP requirement. Record synchronization errors separately with their timestamp and magnitude. Synchronization drift of 2–4 seconds is detectable to viewers and makes captions functionally difficult to use — the spoken word has passed before the caption appears. Synchronization drift greater than 4 seconds produces captions that are effectively meaningless as an accessibility accommodation; the viewer cannot match caption to speech.

For vendor output in SRT format, synchronization checking is straightforward: the SRT file includes start and end timestamps for each caption line. Compare these timestamps to your reference transcript timestamps at each 5-minute anchor. For vendor output in VTT format, the same process applies. For vendor output delivered via LMS integration (the video in the LMS has the caption track embedded), synchronization checking requires watching the video in the LMS and noting any visible lag or lead between audio and caption. The manual observation method is less precise than timestamp comparison but is sufficient for flagging obvious synchronization failures.

Step 4: Compute accuracy

Accuracy = ((total audio words − total error count) / total audio words) × 100

Where total audio words = word count of the reference transcript for the clip (every word the speaker said), and total error count = sum of all substitution, insertion, deletion, and formatting errors marked during the scoring session. Insertions increase the denominator by adding words to the caption that are not in the audio — account for this by computing: accuracy = (correct words) / (total audio words), where correct words = (total audio words) − (substitutions) − (deletions) − (formatting errors) − (insertions reduce correct word count because each insertion is an error that displaces a correct word in the output). For most evaluations with low insertion rates, the simpler formula produces the same result: ((total audio words − total error count) / total audio words) × 100.

Compute a score for each clip in the evaluation corpus, and a weighted average for each content category. Do not average across content types for the headline score without weighting — a 94% average of three soft-skills clips and one technical clip hides the fact that the technical clip scored 81%. The per-clip and per-category scores are the diagnostic output; the overall weighted average is a secondary summary statistic.

Step 5: Vocabulary term accuracy analysis

For each vendor, run a secondary analysis on the vocabulary list you compiled during reference transcript preparation. For each domain-specific term on the vocabulary list, determine whether it was correctly transcribed in the vendor's output across all clips where it appears. Compute: term accuracy = (correctly captioned term occurrences) / (total term occurrences in the corpus) × 100. This analysis typically reveals the sharpest vendor differentiation. On general vocabulary (words in the top 10,000 frequency of English), vendor accuracy differences narrow to 3–5 percentage points. On domain-specific vocabulary (the tail of the vocabulary distribution), vendor accuracy differences expand to 20–30 percentage points. The vendor with the best domain vocabulary accuracy is the vendor whose base model or glossary system is best adapted to your content — and that vendor's accuracy advantage will grow as your content library becomes more vocabulary-specific over time.

Submitting content to vendors: what to share, what to withhold, and how to avoid conditions that inflate scores

The submission protocol is the step in the evaluation process where the most common integrity failures occur. The goal of submission is to give every vendor identical content, submitted under conditions that represent real production behaviour, without providing information that would allow vendors to optimize their results in ways that are not reproducible in production. Several common practices violate these conditions and produce evaluation results that overestimate live performance.

What to provide

Provide: the audio or video files for each evaluation clip, exported at your standard production settings. Provide the content category label for each clip (engineering onboarding, HIPAA compliance training, soft-skills leadership development) — this is the information a vendor would have in production when the content management system submits files through the API. Do not provide your reference transcript or the vocabulary list in the unlabeled evaluation phase. In the labeled evaluation phase, provide the vocabulary list so vendors can configure glossaries — but note that the time available for glossary configuration in the labeled phase must be standardized across all vendors (give all vendors the same time window from vocabulary list delivery to output submission).

What not to provide that changes the evaluation validity

Do not provide sample caption files from previous caption work on the same content — vendors with glossary systems may use caption vocabulary from existing files to pre-configure their model for your specific content. Do not provide your reference transcript to vendors before their output is submitted — if a vendor can see the reference transcript, they know exactly what errors to look for and can correct them manually, producing output that will not be reproducible in production. Do not tell vendors which clips will be used for the DCMP scoring — if vendors know which clips are being scored and which are supplementary, they will prioritize human review on the scored clips. Production behaviour involves automated workflows with spot human review, not full human review on every file. Telling vendors which clips are scored produces human-reviewed output on scored clips and automated output on others — a condition that does not exist in production.

Standardizing turnaround time

Give every vendor the same amount of time to return results from the same submission date. Turnaround time varies significantly by vendor: fast automated pipelines return results in hours; services with human review queues may take 48–72 hours for the standard tier. Allow enough time for the slowest expected vendor to return results under their standard SLA. Do not allow vendors to request extensions for the evaluation corpus — if a vendor cannot meet their stated SLA on an evaluation corpus of this volume, that is a signal about how they will perform in production when your full video queue is submitted. The failure to meet SLA on the evaluation corpus is itself a red flag (see the red flags section below).

API vs. file upload submission consistency

If your production deployment will use the vendor's API (the file upload, caption download, LMS integration API), test via the API — not via a manual upload interface if a manual interface exists. Vendors occasionally maintain separate processing queues for API submissions and evaluation submissions, where evaluation submissions receive expedited processing or prioritized human review that is not applied to production API traffic. Submitting via the production API removes this risk. If the vendor does not have a self-service API available for evaluation (some require a signed agreement before API access is provisioned), document this as a friction signal: you are evaluating a vendor whose production interface requires a signed agreement before you can test what you are about to sign for.

What to ask vendors to provide alongside the output

Request from each vendor, alongside the caption output files: (1) the accuracy metric they compute on the evaluation corpus, using their own measurement methodology — this allows you to compare their self-reported accuracy to your DCMP score, and the gap between the two is diagnostic of how their measurement methodology compares to DCMP; (2) the confidence scores or per-word probability scores if their model produces them — some vendors can provide confidence metadata alongside the caption file that shows which words the model was uncertain about; (3) the glossary terms they applied in the labeled evaluation phase and how many terms were added from the vocabulary list you provided; (4) their production SLA for this content volume, stated in hours. These requests are also tests: vendors who refuse to share their accuracy measurement methodology, who cannot provide confidence metadata, or who cannot state a production SLA for your volume are revealing operational transparency gaps that will matter during the contract and in production QA.

Interpreting results and weighting by content type

When vendor results return, you have per-clip DCMP scores for each vendor across each content category. The interpretation task is to convert these scores into a vendor comparison that accounts for your organization's specific compliance risk profile — which means weighting the scores by content type rather than averaging uniformly. An organization whose training library is 70% compliance and technical content and 30% soft-skills content should weight compliance and technical clip scores at 70% and soft-skills scores at 30% in the weighted average. An organization with the inverse profile weights the inverse way. The weighted average aligns the evaluation result with the actual compliance risk distribution in the live content library.

Threshold analysis vs. average comparison

The most important interpretation question is not "which vendor has the highest average score?" but "which vendors clear the 99% DCMP threshold on the content types where we have the highest compliance obligation?" Threshold analysis is binary per content type: the vendor either meets 99% or does not. A vendor that scores 99.2% on soft-skills content and 88% on compliance training is below threshold on the content type where your compliance risk is highest. The fact that their average is 93.6% is irrelevant — you need 99% on compliance training, not 93.6% on average. The threshold analysis should be applied to your highest-obligation content types first. Any vendor that does not clear threshold on mandatory compliance or regulated content should be eliminated from vendor consideration regardless of their other scores.

Vendor comparison framework — score by content type with threshold indicator
Content type	Weight	Threshold	Vendor A	Vendor B	Vendor C
Technical / engineering onboarding (2 clips avg)	25%	99%	87.4% ✗	96.2% ✗	98.9% ✗
Compliance / regulatory training (2 clips avg)	35%	99%	83.1% ✗	91.7% ✗	97.3% ✗
Soft-skills / leadership (1 clip)	20%	99%	96.8% ✗	98.4% ✗	99.1% ✓
Executive / org comms (1 clip)	10%	99%	93.2% ✗	95.7% ✗	98.6% ✗
Product / sales enablement (1 clip)	10%	99%	84.6% ✗	89.3% ✗	96.1% ✗
Weighted average	100%	99%	88.7% ✗	93.8% ✗	97.9% ✗

This table illustrates the common outcome of a rigorous pre-contract evaluation: no vendor clears the 99% threshold on every content type in the unlabeled evaluation phase. This is the expected result for any vendor relying primarily on their base ASR model without per-customer glossary customization. The evaluation tells you not only which vendor scores highest but which vendor is closest to threshold and on which content types. The follow-up question, when no vendor clears threshold unlabeled, is: what can each vendor achieve with glossary configuration on the content types where they are furthest from threshold? The labeled evaluation phase answers that question. For the Rev vs. GlossCap comparison, the 3Play vs. GlossCap comparison, and the Verbit vs. GlossCap comparison — all three evaluations show this same pattern: unlabeled base model accuracy well below 99% on technical training content, and significant improvement in glossary-configured evaluations, with GlossCap's per-customer glossary producing the largest gain on domain-specific vocabulary.

What to do when no vendor clears threshold

The evaluation may produce results where no shortlisted vendor achieves 99% DCMP accuracy on your technical training content in the unlabeled phase. This is a signal about the state of the market, not a failure of the evaluation. In this case, the evaluation output shifts from "which vendor to select" to "which vendor is closest to threshold and what workflow addition would close the gap." Three common resolution paths: (1) The vendor with the best unlabeled technical training accuracy runs a labeled evaluation with glossary configuration to verify they can reach 99% with glossary on your content — the glossary onboarding process and timeline become part of the contract scope; (2) The evaluation reveals that glossary-based AI captioning is the only path to 99% on your technical content, and the evaluation is rerun with vendors who specialize in glossary-conditioned workflows; (3) The contract includes a post-deployment accuracy ramp provision — the vendor achieves 95%+ unlabeled immediately, and the contract specifies a 60-day glossary build period after which DCMP verification is run and the accuracy guarantee provisions become active. None of these paths requires abandoning the evaluation — they require using the evaluation output to structure the contract rather than simply to rank vendors.

The gap analysis: vendor accuracy vs. your compliance threshold

For each vendor, compute the gap between their highest-scored content type and their lowest-scored content type. A gap of more than 8 percentage points between easiest and hardest content types signals a vocabulary-driven accuracy cliff that will worsen as your library's technical content grows. A gap of 12 percentage points or more should be treated as a structural disqualifier unless the vendor can explain the gap in terms of a specific solvable problem (e.g., "our model was not pre-trained on healthcare vocabulary, but we can close this gap with the clinical glossary within 30 days of onboarding"). The gap analysis also tells you where the vendor's model risk concentrates — which content types in your library will require more QA attention and correction overhead in production. For ongoing QA cadence planning, the QA methodology post covers how to set spot-check frequency based on content type risk profile, which connects directly to the accuracy gaps revealed in the evaluation.

Red flags that predict live-environment failure

Beyond the accuracy scores themselves, the evaluation process surfaces behavioural and process signals that predict how a vendor will perform in production. These red flags are often more predictive than the scores because they reveal vendor behaviour under constrained conditions — the evaluation is, among other things, a controlled observation of how the vendor operates when they know they are being evaluated. If these signals appear during the evaluation, they will be amplified in production when the vendor has less incentive to perform and more content volume to process.

Red flag 1: Accuracy cliff between easiest and hardest content

If a vendor scores 97% on your soft-skills control clip and 81% on your technical onboarding clip, they have demonstrated a 16-point accuracy cliff on domain vocabulary. This is not a marginal difference in a noisy measurement — it is evidence that the vendor's base model is performing well on conversational English and failing on the vocabulary that makes your technical content technically dense. This gap will not narrow in production; it will widen as your content library grows into more specialized content types and as product names, platform versions, and organizational vocabulary evolve. A vendor with this accuracy cliff pattern should not be selected without a documented glossary configuration plan that demonstrably closes the gap in the labeled evaluation phase, and the contract should include accuracy testing provisions for new content types added to the library.

Red flag 2: Accuracy improvement that exceeds expectation in the labeled phase

Counterintuitively, a very large accuracy improvement in the labeled evaluation phase is a red flag as well as a signal of capability. If a vendor improves from 83% to 99.4% on technical training content after receiving your vocabulary list, they have demonstrated that glossary configuration is doing the full 16 percentage points of work — which means the glossary must be comprehensive, current, and well-maintained in production. A vendor that depends this heavily on a pre-configured glossary requires a robust glossary onboarding workflow, a documented process for updating the glossary as product names and organizational vocabulary change, and operational accountability for glossary freshness. A vendor who cannot describe their glossary onboarding process, who cannot estimate how quickly they can update the glossary when a product is rebranded, or who treats the glossary as a one-time setup rather than an ongoing maintenance responsibility has revealed a gap that will produce accuracy drift in production within 6–12 months. The glossary architecture post covers the maintenance cadence requirements that make vocabulary-conditioned accuracy sustainable over time.

Red flag 3: Vendor asks to use their sample content instead of yours

A vendor who says "instead of your content, let us caption these sample training videos we use for evaluations" is attempting to control the test corpus in their favor. Their evaluation content has been selected to produce high scores — it uses vocabulary their model handles well, has recording quality at the high end of the spectrum, and has likely been through internal QA before being offered as evaluation content. This request should be declined. The evaluation corpus is your content precisely because your compliance risk is with your content, not with a vendor's curated examples. A vendor who cannot or will not process your content for evaluation purposes is not demonstrating the confidence in their model that their marketing materials assert. You may offer to provide clips that are not proprietary and do not require an NDA, but the content must come from your library.

Red flag 4: Synchronization failures despite good word accuracy

Some vendors, particularly those with speech recognition pipelines that are optimized for word accuracy rather than real-time synchronization, produce caption files that score 96%+ on word accuracy but show consistent synchronization drift of 3–5 seconds. A 96% word-accurate caption that appears 4 seconds after the spoken audio is not a compliant caption accommodation — the DCMP Captioning Key requires both word accuracy and synchronization within 2 seconds. A viewer with hearing disability using the caption track as their primary audio access point cannot follow content where captions are arriving several seconds after the speech they describe. Synchronization failures may not appear on vendor-provided demo content because demo content often uses pre-produced caption files rather than real-time pipeline output. They appear on your evaluation content when it is processed through the vendor's production pipeline. Any vendor with systematic synchronization drift greater than 2 seconds across multiple clips should be treated as failing the evaluation regardless of their word accuracy scores.

Red flag 5: Results delivered faster than the stated production SLA

If a vendor states a 24-hour SLA for your content volume and delivers evaluation results in 3 hours, that is worth investigating rather than celebrating. Possible explanations: the vendor routed your evaluation content to a priority queue not available to production customers; the vendor applied expedited human review to the evaluation content that is not part of the standard production workflow; the vendor's stated SLA is much more conservative than their actual typical turnaround. Any of these has implications. The first two produce evaluation results that overestimate production accuracy and underestimate production turnaround time. Ask the vendor explicitly: what queue was this content processed through, and is that the same queue that production API submissions enter? If they used a different queue, request that the evaluation be rerun through the production queue. Accurate turnaround time SLA data matters: caption correction labour cost is directly affected by how quickly caption files are available after video upload. A vendor whose production SLA is 48 hours rather than the 3-hour evaluation turnaround creates a different production workflow design than a vendor whose production SLA genuinely matches the evaluation turnaround.

Red flag 6: Speaker identification absent on multi-speaker content

If you included multi-speaker evaluation content (an interviewer and subject matter expert, a panel discussion, a scenario with multiple characters), check whether the vendor's output includes speaker identification in the caption file. The DCMP Captioning Key requires speaker identification in content with distinguishable speakers. Most auto-caption pipelines — including YouTube, Teams, and Zoom — do not add speaker labels to caption output even when their live transcription interface performs diarization. A vendor that produces speaker-free captions on multi-speaker training content is producing caption output that fails the DCMP completeness requirement, regardless of word accuracy. If multi-speaker training content is common in your library (recorded interviews, panel discussions, scenario-based compliance training with multiple actors), speaker identification capability is a required feature and its absence in evaluation output is a disqualifier.

Red flag 7: Vendor cannot explain their accuracy measurement methodology

When you ask vendors to share their accuracy computation on the evaluation corpus alongside the caption output, the response tells you about the vendor's measurement practice. A vendor who can say "we computed word-level accuracy on a 10% random sample using DCMP Captioning Key error taxonomy, sampling matched your reference transcript, and we scored 94.2% word accuracy on the technical clips and 98.1% on the soft-skills clips" is operating with a documented measurement methodology. A vendor who says "our accuracy is generally 99% for training content" and cannot produce a clip-specific measurement from your evaluation content is not measuring accuracy on your content — they are presenting a marketing number. A vendor who says "our AI achieves 99.3% WER" but cannot explain what corpus the WER was measured on, or what measurement protocol was applied, has not measured what DCMP measures. The auto-caption compliance post covers why corpus-level WER consistently overstates DCMP accuracy on technical training content — and a vendor who does not know the difference between their WER benchmark number and a DCMP measurement has not thought carefully about compliance-relevant accuracy.

Red flag 8: Significant accuracy variation between clips of the same content type

If you included two clips from the same content category (two compliance training clips, two engineering onboarding clips) and the vendor's scores differ by more than 5 percentage points between clips of the same type, the model is unstable across the vocabulary distribution of that content type. A vendor that scores 91% on your HIPAA training clip and 84% on your PHI minimum-necessary training clip — both regulatory content, both similar vocabulary domains — has revealed that their accuracy on this content type depends on which specific terms appear, not on any stable domain competence. Accuracy variability within a content type is operationally problematic because it prevents you from predicting which production files will require QA attention. A stable vendor should score within 3 percentage points on clips of the same content type (controlling for audio quality). Variability greater than 5 points indicates that the model is sensitive to specific vocabulary occurrences rather than domain-level patterns — which predicts unpredictable production accuracy.

Eight failure modes in vendor accuracy evaluation

1. Using vendor-provided sample content instead of your own

The foundational failure of vendor accuracy evaluation is accepting the vendor's curated sample content as the evaluation corpus. Vendors offer sample evaluation content for predictable reasons: their sample content has been selected and prepared to produce high scores, their model has often been optimized on the vocabulary of that content, and the recording quality of sample content is typically better than real organizational training content. Evaluating on vendor sample content is not measuring vendor accuracy on your content — it is measuring vendor accuracy on the content the vendor chose for you to measure. The result is an accuracy score that tells you nothing about what the vendor will produce on your HIPAA compliance training or your engineering onboarding module. Decline vendor sample content for primary evaluation purposes. If you want to understand vendor best-case performance, you can run vendor sample content as supplementary evaluation after running the primary evaluation on your own corpus — but the vendor sample result should carry no weight in the vendor selection decision.

2. Qualitative review in place of DCMP word-level scoring

The second most common failure: the evaluation team reads through caption output and assigns a qualitative accuracy estimate ("this looks like about 95%") rather than running DCMP word-level scoring. The 99% accuracy post documents the systematic gap between human qualitative review and DCMP measurement: human reviewers consistently overestimate caption quality by 5–12 percentage points for technical training content, because they read fluently past errors that would register in a word-level count. A file that a reviewer calls "95% accurate" is typically 83–88% on DCMP measurement. If the evaluation is conducted via qualitative review and vendor A is rated "about 95%" and vendor B is rated "about 92%," the only defensible conclusion is that both vendors failed — because neither is producing 99% DCMP accuracy. But qualitative review masks this conclusion, and both vendors advance in the procurement process with the impression that accuracy is "close enough." DCMP scoring is not optional for a compliance-relevant evaluation; it is the measurement that produces the number the compliance framework requires.

3. Sampling only the content type that is easiest for ASR models

Evaluations designed by teams who have not thought carefully about the diagnostic purpose of the corpus consistently over-index on soft-skills and leadership development content — the most common content type in many L&D libraries, and the easiest content type for ASR models. If the evaluation corpus is 70% soft-skills content because that represents 70% of your video volume, every vendor will score in the 95–98% range on that content, and the evaluation will produce vendor differentiation of 2–3 percentage points — not sufficient to identify a vendor who will fail on your compliance training. The diagnostic content is the hard content. Select it disproportionately for the evaluation even if it represents a minority of your current library volume. Your compliance obligation is concentrated on required content — compliance training, safety training, regulated industry training — not on the soft-skills training that represents volume but not risk.

4. Testing only one clip and extrapolating to the full library

Evaluations with a single clip in the corpus — one engineering onboarding video, or one compliance training module — extrapolate from a single data point to a production conclusion. A single clip cannot represent vocabulary variation, audio quality variation, speaker variation, or content difficulty variation across your training library. A vendor that scores 97% on the single clip in your evaluation corpus may score 82% on a different clip from the same content category that happens to include the product names and system identifiers that are not in the single evaluated clip. The minimum viable corpus is five to seven clips spanning multiple content categories. Evaluations below this minimum are statistically underpowered for production prediction.

5. Not preparing reference transcripts before sending content to vendors

When reference transcript preparation is deferred until after vendor output is received, the team almost always conducts qualitative review rather than DCMP scoring. Preparing a reference transcript after seeing vendor output introduces confirmation bias: the reviewer preparing the transcript unconsciously checks it against the vendor output they have already read rather than against the audio. The reference transcript must be prepared before the content is submitted, while the team has not yet seen any vendor output. If the reference transcript is not ready before content is submitted, delay the submission rather than starting the process without the measurement foundation. The evaluation conducted without a reference transcript is not an accuracy evaluation — it is a reading review that will produce the qualitative-overestimate failure mode described above.

6. Testing the demo interface but not the production API pipeline

Some captioning vendors maintain separate processing paths for demo and evaluation submissions versus production API submissions. The demo and evaluation path may route content to human reviewers, apply manual QA, or use a higher-compute processing configuration that is not applied to production API volume. Testing via the demo interface and then deploying via the production API can produce a significant accuracy decline at deployment. Wherever your production deployment will use the API, run the evaluation via the API. If the vendor gates API access behind contract signature, make API accuracy evaluation an explicit contract provision: "vendor agrees to provide API access for accuracy evaluation prior to contract execution, and acknowledges that the evaluation accuracy data from API-processed content is the baseline for the accuracy guarantee in this agreement." Vendors who will not agree to this provision are signaling that their evaluation and production accuracy are different.

7. Accepting accuracy claims without measurement methodology specifics

When a vendor says "our accuracy is 99% for training content," accept only this response: "99% word accuracy measured by the DCMP Captioning Key on a stratified sample of [specific content type, audio quality range, vocabulary density range] content from [specified corpus or reference dataset]." Any accuracy claim that does not specify the measurement methodology, the content the measurement was run on, the sampling protocol, and the error categories counted is not a DCMP-methodology claim. Corpus-level WER on general English content, subjective quality ratings from customer reviews, and industry benchmark scores on public ASR datasets are not DCMP measurements and are not predictive of what you will observe when you run DCMP scoring on your training content. Require measurement specificity and decline to count any accuracy claim that cannot provide it. This requirement also applies to your own evaluation: if you are not running DCMP word-level scoring with a reference transcript, you do not have a measurement — you have an impression.

8. Conducting the evaluation with an incomplete content inventory

The evaluation corpus should represent your current content library and your near-term content production plan. If you have not yet inventoried your library at the content category level — how many videos are compliance training, how many are technical onboarding, how many are soft-skills — the evaluation corpus cannot be weighted to reflect your actual compliance risk profile. Evaluations conducted before the content inventory is complete typically use convenience sampling: the L&D team sends videos they know well or have recently worked with, rather than videos selected to represent the full risk spectrum. The content inventory should be completed before the evaluation corpus is designed. The LMS audit methodology post covers how to run the content inventory process — which also serves as the foundation for remediation prioritization and QA sampling frequency decisions. The evaluation design question "what is in my library?" and the audit question "which content in my library needs remediation?" are answered by the same inventory process.

Seven questions L&D teams ask about pre-contract vendor accuracy evaluation

How many vendors should we include in the accuracy evaluation?: Include all vendors on your shortlist from the RFP process, typically three to five. Running the evaluation on the full shortlist rather than a pre-selected front-runner is worth the added effort because accuracy evaluation frequently changes the rank order. Vendors that score well on RFP criteria (glossary capability, BAA availability, LMS integrations, pricing) do not always score best on DCMP accuracy with your specific content. Running the evaluation only on your RFP front-runner produces a pass/fail against a threshold ("does this vendor meet 99%?") rather than a comparative ranking. The comparative ranking matters when no vendor clears threshold on your hardest content — you need to know which vendor is closest and on which content types, so you can determine whose gap is most closeable through glossary configuration. Running the evaluation on three to five vendors produces that comparative data. Running it on one vendor produces a binary answer that does not help you if the answer is "no."
What if our content contains confidential or proprietary information that we can't share with vendors we haven't signed?: This is a real constraint for industries with strong confidentiality obligations — healthcare content with patient-adjacent scenarios, financial services content with trading procedures or client information, legal services content with privileged matter simulations. Two paths forward: First, execute a mutual non-disclosure agreement with each shortlisted vendor before submitting evaluation content. An NDA does not create the compliance obligations of a production contract but establishes confidentiality protections for the evaluation materials. Most vendors will sign an NDA for evaluation purposes at the shortlist stage. Second, select evaluation content that represents your vocabulary distribution without exposing confidential specifics. A HIPAA training clip that refers to "protected health information," "covered entity obligations," and "minimum necessary standard" without including real patient data or proprietary client scenarios gives the vendor a representative vocabulary test without confidential content. For the captioning RFP template, the vendor qualification section includes a request for the vendor's NDA template and their standard data handling policy for evaluation content — collecting this in the RFP phase means you have the NDA ready when you advance to evaluation.
How do we handle vendors who claim their model is continuously improving — won't the evaluation scores be outdated by contract time?: Model updating is a real dynamic: ASR models from the major vendors are retrained periodically, and accuracy on specific content types may change between evaluation and production deployment. However, continuous improvement claims should not be taken at face value as a reason to skip or discount the evaluation. First, "continuously improving" does not mean "improving on your specific vocabulary" — general model updates may improve accuracy on general speech without affecting domain-specific vocabulary that is not in the training distribution. Second, the evaluation establishes a baseline and a measurement methodology: even if the model improves between evaluation and deployment, the DCMP measurement process you ran for the evaluation is the same process you use for ongoing QA. Third, "our model will be better by the time you deploy" is a claim you cannot verify without running a post-deployment accuracy measurement, which is ongoing QA. The pre-contract evaluation measures what exists now; the contract includes an accuracy guarantee for what will exist in production. The guarantee is only meaningful if the pre-contract measurement established the baseline and the measurement methodology. Do not accept model improvement claims as a substitute for measurement.
Is it reasonable to expect any vendor to achieve 99% on our technical training content without a glossary?: For technical training content with domain-specific vocabulary — engineering platforms, clinical terminology, compliance regulation citations, product names — no, it is generally not reasonable to expect 99% DCMP accuracy from a base ASR model without glossary configuration. The Whisper accuracy benchmarks by vertical show that even the best general-purpose ASR models score 83–89% on technical training content without domain vocabulary conditioning. The 10–15 percentage point gap between base model accuracy and the 99% threshold on technical content is driven by low-frequency domain vocabulary that is outside the training distribution of any general-purpose model. This means that glossary configuration is not an optional enhancement for technical content — it is the mechanism that closes the accuracy gap. The practical evaluation question is not "does this vendor achieve 99% without a glossary?" (almost none do on technical content) but "does this vendor achieve 99% with a properly configured glossary, and how quickly and thoroughly can they configure it?" The labeled evaluation phase answers this question. The gap between unlabeled and labeled accuracy — and the time and process required to close it through glossary configuration — is often the most useful vendor differentiation the evaluation produces.
What should we do if the evaluation reveals that no vendor meets the 99% threshold even with glossary configuration?: This outcome occurs when the evaluation content contains vocabulary categories that are difficult or impossible to address through standard glossary configuration: highly context-dependent meaning (the same word means different things depending on the preceding sentence), very fast speech pace with minimal pause structure, significant audio quality degradation, or extremely rare specialized terms that appear only once or twice in the corpus. If no shortlisted vendor meets 99% with glossary on your critical content, three escalation paths exist. First, revisit the glossary configuration with the highest-scoring vendor: was the glossary configured with the right terms (proper nouns, acronyms, domain vocabulary — not function words or common terms), and was the vendor given sufficient time and vocabulary scope to optimize? A 30-minute glossary configuration window is different from a 5-day vocabulary-enriched model build. Second, consider a hybrid workflow for the hardest content: AI captioning to 95%, followed by human review focused on the domain vocabulary terms, targeting 99% through focused correction. The caption feedback loop post covers how this hybrid approach compounds accuracy over time. Third, reconsider the shortlist: if the evaluation corpus reveals that the shortlisted vendors cannot meet threshold on your critical content type with any configuration, expand the shortlist to include vendors who specialize in your vertical or who use a different architecture for domain vocabulary handling.
How does the evaluation connect to the accuracy guarantee clause in the contract?: The evaluation and the contract accuracy guarantee should reference the same measurement methodology explicitly. The contract clause should specify: (a) the accuracy metric (DCMP Captioning Key word-level accuracy), (b) the threshold (99%+), (c) the content scope (all content categories included in the evaluation, or specified additional categories), (d) the sampling protocol (random 10% of monthly volume, or quarterly spot-check of defined content types), and (e) the remediation commitment (re-delivery of failing files within 24 hours, credit for files below threshold, or equivalent). The pre-contract evaluation is what gives you the empirical grounding to negotiate these terms with specificity rather than accepting the vendor's standard accuracy language. Without evaluation data, you are negotiating an accuracy guarantee on faith — the vendor sets the terms because you have no measurement basis for a counter-proposal. With evaluation data, you know which content types scored at what levels, which vocabulary categories drove gaps, and how quickly the vendor's glossary configuration closed those gaps. The vendor contract review checklist provides the specific clause language for each of these five components; the evaluation provides the evidence to put behind the clause.
We're a small L&D team with limited capacity for evaluation. What's the minimum viable evaluation we can run?: The minimum viable evaluation that produces compliance-relevant results requires: five clips (one from each of: your hardest technical content, your mandatory compliance training, your most common content type, your cleanest audio, and your worst-case audio), reference transcripts for all five clips, and DCMP scoring run by at least one reviewer per clip per vendor. This is roughly 40–70 minutes of video content, 20–35 hours of reference transcript preparation for a team unfamiliar with the vocabulary (less for subject matter experts), and 10–15 hours of DCMP scoring across all vendors and clips. For a team of two with 20% time allocation over two weeks, this is achievable. What is not achievable in this time window is the full 60–120 minute corpus and labeled evaluation phase — which means the minimum viable evaluation produces unlabeled base model results, without the glossary configuration data that reveals vendor ceiling with optimization. If you can only run the minimum viable evaluation, treat it as a disqualification round (eliminate vendors who fail on your easiest clip or show synchronization failures) and require a post-deployment accuracy measurement milestone in the contract with go/no-go criteria. A contract that includes "accuracy verification at 60 days post-deployment with remediation commitment if below 99% DCMP" partially compensates for a compressed pre-contract evaluation — but it shifts the discovery risk to the post-deployment phase, which is the higher-cost outcome the pre-contract evaluation is designed to avoid.

See GlossCap's accuracy on your training content before you sign

The accuracy evaluation methodology in this post is the framework GlossCap recommends for every vendor selection process — including evaluations that include GlossCap. If you are running a pre-contract evaluation of captioning vendors, the process works: design a diagnostic corpus with your hardest technical content, prepare reference transcripts, run DCMP-protocol scoring, and compare results. GlossCap's per-customer glossary architecture is designed specifically to close the accuracy gap on technical training content that unlabeled base model evaluations expose — the 83–89% range that general ASR models produce on engineering, clinical, and compliance content moves to 97–99%+ with a properly configured customer glossary. The GlossCap widget demo shows the accuracy difference on a technical training clip between default auto-caption output and GlossCap's glossary-conditioned output on the same audio. You can also walk through the RFP template to see how the accuracy evaluation framework integrates with the full procurement process, and read the captioning RFP playbook for the procurement workflow that leads to the evaluation step covered in this post.