Procurement Operations · Published 2026-07-02

How to evaluate a caption vendor’s RFP response: three failure modes in vendor accuracy claims and what genuine WCAG 2.1 AA evidence looks like

Caption vendor RFP responses almost always include an accuracy claim. “Our platform delivers 99% accuracy.” “We are fully WCAG 2.1 AA compliant.” “Our human-in-the-loop workflow guarantees 99%+ word accuracy on all content types.” These statements are nearly universal, and they are nearly meaningless as written. Not because the vendors are lying — some of them can back the numbers up — but because the captioning industry has no standard format for presenting accuracy evidence, and the three most common patterns for presenting accuracy claims in RFP responses share a structural flaw: they are unverifiable without additional information that vendors rarely volunteer and procurement teams rarely know to request. This post covers the anatomy of those three failure modes, explains what each one looks like in a real RFP response, and describes what genuine WCAG 2.1 AA compliance evidence requires. It also covers how to build a scoring rubric that differentiates vendors by the quality of their evidence rather than the confidence of their claims, and how to use the information-request template to get what you actually need before you commit to a pilot. This post is the complement to the captioning RFP playbook (which covers how to structure the RFP and run the process) and the vendor accuracy evaluation methodology post (which covers how to run your own independent accuracy test during the pilot). This post covers the step in between: reading what the vendor sent back and determining whether their claims are worth testing at all.

TL;DR

  1. Most caption vendor RFP responses contain accuracy claims that cannot be verified as stated. The three structural failure modes — self-reported scores without methodology, representative-content sampling, and WCAG 2.1 AA claims without DCMP per-cue scoring — account for the majority of accuracy claims in vendor responses across the captioning market. Identifying which failure mode is present in a vendor’s response tells you more about their accuracy than the number they report. A vendor who says “99%” without specifying methodology may be measuring something completely different from a vendor who says “97%” with a DCMP Captioning Key reference and a diagnostic content sample. The higher number is not necessarily the better vendor.
  2. Failure mode 1: self-reported accuracy scores without methodology disclosure. A vendor reports an accuracy number (typically 98–99%) without disclosing what content type was measured, what measurement protocol was used, whether scoring was conducted independently or internally, and what the reference transcript process was. Without these four pieces of information, the number has no interpretive value. Any modern ASR system can produce a self-reported number in the high nineties by choosing the right content type and counting method. The first question for any vendor response that contains an accuracy claim is: “What is your methodology disclosure?”
  3. Failure mode 2: representative-content sampling using diagnostic-unsuitable content. Vendors demonstrate accuracy on content that scores well regardless of which captioning system is used — studio-recorded soft-skills and leadership training video, where all major ASR engines achieve 93–96% word accuracy. Technical training content (engineering vocabulary, medical terminology, financial regulatory language, OSHA safety procedure vocabulary) scores 73–85% on the same engines. A vendor who claims “99% accuracy” based on a soft-skills sample is not lying about that sample; they are withholding the performance data that matters to your organisation. The corrective is to require that any accuracy demonstration or sample be scored on content that matches your actual training library by vocabulary domain and audio quality.
  4. Failure mode 3: WCAG 2.1 AA compliance claims without DCMP per-cue scoring protocol. WCAG 2.1 AA Success Criterion 1.2.2 requires synchronised captions on pre-recorded video content. In enforcement contexts, the accuracy standard applied to this criterion is 99%+ word accuracy measured under the DCMP Captioning Key protocol — which counts errors per caption cue, not per transcript, and treats cue-level error density differently from overall document error rate. A vendor who claims “WCAG 2.1 AA compliant” or “99% accuracy” without specifying DCMP methodology may be using aggregate word-error-rate (WER) calculated across the entire transcript, which will produce a substantially higher number for the same caption file than per-cue DCMP scoring. See the WCAG 2.1 AA captions reference page for a full breakdown of what the standard technically requires.
  5. Genuine WCAG 2.1 AA accuracy evidence has four required components. (1) Content type specification: the sample was drawn from diagnostic technical content matching the organisation’s training library, not from soft-skills studio content. (2) Reference transcript: a verbatim human transcript of the sample audio was produced before captioning, which served as the ground truth for scoring. (3) DCMP Captioning Key scoring methodology: errors were counted per caption cue, not averaged across the transcript. (4) Third-party scoring: scoring was conducted by a party independent of the vendor, not internally. Any vendor who can produce documentation of all four has given you the foundation for a meaningful pilot. See the vendor accuracy evaluation methodology post for how to conduct independent scoring yourself.

The anatomy of an accuracy claim in a caption vendor RFP response

When a captioning RFP arrives in a vendor’s inbox, the accuracy section typically triggers one of three drafting patterns, each of which maps to one of the three failure modes described below. Before examining the failure modes, it helps to understand what the accuracy section of an RFP typically asks — and what vendors understand the question to mean.

A typical RFP accuracy question reads something like: “Describe your captioning accuracy performance. What accuracy rate do you achieve, and how do you measure it?” Some RFPs are more specific, asking for “percentage word accuracy on technical training video content” or “WCAG 2.1 AA compliance methodology.” Most are not. The question as typically written invites the vendor to define the terms of their own answer, which is why the responses are so difficult to compare.

Vendors who receive this question face a genuine challenge: they serve customers across many content types and audio quality levels, and their accuracy varies significantly across those contexts. A vendor who captions university lecture video will perform very differently on a manufacturing safety training video with background noise. Rather than disclose this variance (which would require explaining it in detail and inviting awkward follow-up questions about worst-case performance), vendors tend to report the number that represents their best-case or average-case performance on the most favourable content type in their library.

The result is a market where nearly every caption vendor’s RFP response reports accuracy in the 97–99% range, making it effectively impossible to differentiate vendors based on accuracy claims alone. The solution is not to demand a higher number — it is to demand a better-structured claim that discloses the conditions under which the number was produced.

The captioning RFP playbook covers how to structure the accuracy section of the RFP itself to elicit meaningful responses. This post assumes you have already received vendor responses and need to evaluate them. The goal is to move from “Vendor A says 99%, Vendor B says 98%, score accordingly” to “Vendor A’s 99% claim is unverifiable; Vendor B’s 98% claim has a methodology disclosure, diagnostic sample, and reference transcript — score accordingly.”

Failure mode 1: self-reported scores without methodology disclosure

What it looks like in a vendor response

The simplest and most common form of an unverifiable accuracy claim is a bare number with no supporting methodology. In RFP responses, this typically appears as:

“Our platform consistently delivers 99% accuracy across all content types. Our proprietary AI, combined with our human review team, ensures that every caption file meets or exceeds the 99% accuracy threshold required for WCAG compliance.”

Or in a more elaborate form:

“Accuracy rate: 99%+. Methodology: AI-first captioning with human quality review. Our QA team reviews all caption files before delivery. We have achieved 99%+ accuracy for over 95% of our enterprise clients in the past 12 months, as documented in our quarterly quality reports.”

These statements feel substantive because they include multiple sentences and reference internal processes. But they contain no verifiable information about accuracy. The four missing pieces are:

  1. Content type specification: What kind of content was used to derive the number? Studio-recorded soft-skills training? Medical procedure training? Engineering API tutorial? Frontline safety procedure video shot on a factory floor?
  2. Measurement protocol: How was accuracy measured? Word error rate (WER) computed over the entire transcript? Per-cue DCMP Captioning Key scoring? A different counting method? The same caption file will produce substantially different accuracy numbers depending on which counting method is applied.
  3. Reference transcript source: What was the ground truth against which accuracy was measured? A human-produced verbatim transcript? A vendor-internal transcript generated by the same system? A cleaned version of the AI output?
  4. Scoring independence: Who conducted the scoring? The vendor internally? A third-party QA firm? The client? Self-reported accuracy, even with a disclosed methodology, has lower evidentiary value than independently scored accuracy.

Without all four of these pieces, an accuracy claim tells you nothing meaningful about how the vendor will perform on your content.

Why procurement teams accept these claims

Procurement teams accept self-reported accuracy claims for several reasons. First, they are universal — every vendor makes them, so they feel like table stakes rather than a distinguishing claim. Second, the number is usually 99%, which happens to be the WCAG 2.1 AA threshold that procurement teams are trying to meet, so the claim appears to directly answer the compliance question. Third, most procurement teams lack the technical background to know what questions to ask about measurement methodology, and the RFP response is not the place where they expect to have to become experts in WER calculation.

The practical effect is that Vendor A, who says “99%” with no further detail, receives the same score on the accuracy criterion as Vendor B, who says “97%” with a DCMP Captioning Key reference, a link to a diagnostic content sample, and an independent scorer. This is backwards. The vendor with a lower claim backed by verifiable methodology is almost certainly more capable of actually meeting the 99% standard on your content than the vendor with a higher claim backed by nothing.

What to do when you encounter this failure mode

When a vendor response contains an accuracy claim without methodology disclosure, do not score it at face value. Instead, send a standardised information request (template below) before scoring the accuracy criterion. The information request should ask for:

The response to this information request is itself a scoring signal. A vendor who responds with a full methodology disclosure in 48 hours is demonstrating that they have a quality programme sophisticated enough to track this data. A vendor who responds with “our QA team ensures 99% accuracy on all content” has confirmed that no methodology documentation exists. Score accordingly.

The caption quality error rate calculator post explains the specific formulas for WER and DCMP scoring so you can evaluate whether the protocol a vendor describes is actually what they are using.

Scoring implication

If you use a weighted accuracy scoring criterion in your RFP (for example, a 35-point accuracy section out of 100 total), treat methodology disclosure as a threshold requirement rather than a bonus. A vendor who cannot disclose methodology after an information request should receive zero points on the accuracy criterion regardless of the number they claim — because you have no basis to evaluate their claim. This may feel harsh, but it correctly reflects the epistemic situation: zero information is worth zero points. A vendor who scores 8/35 on accuracy with a disclosed, documented, independently verified 97% is a better vendor selection than one who scores 35/35 on a self-reported, methodology-free 99.5%.

Failure mode 2: representative-content sampling

The content-type accuracy gap in AI captioning

The most consequential difference between a misleading vendor accuracy claim and an honest one is not the number itself but the content on which it was measured. Modern AI captioning systems — including all major ASR engines and the hybrid AI-plus-human-review workflows that most enterprise caption vendors use — perform at very different accuracy levels depending on the content type. This variance is large enough to make a vendor who “achieves 99% accuracy” on soft-skills content the same vendor who “achieves 78% accuracy” on your engineering onboarding library. The same vendor, the same product, the same workflow — different content, very different accuracy.

The accuracy spread by content type is approximately:

Content type Typical ASR accuracy range Primary failure driver
Studio-recorded soft skills (communication, leadership, DEI, customer service) 93–97% Minimal vocabulary failure; general speech corpus coverage high
Corporate onboarding (HR policies, benefits, culture) 91–95% Organisation-specific proper nouns (people, team names) not in corpus
Sales enablement (product demos, competitive positioning) 82–90% Product names, feature terminology, competitor names absent from training corpus
Engineering and developer training (APIs, SDKs, code syntax) 79–87% Camel-case identifiers, version strings, language keywords, method names
Medical and clinical training (drug names, procedures, anatomy) 73–82% Brand and generic drug names, Latin anatomy terms, clinical procedure vocabulary
Manufacturing safety and OSHA compliance 75–84% Lockout/tagout procedure terms, equipment names, PPE categories, OSHA citation numbers
Financial services regulatory training (FINRA, SEC, MiFID II) 80–88% Regulatory citation formats, instrument names, compliance programme vocabulary
Cybersecurity and IT training (MITRE, NIST, threat vocabulary) 78–85% CVE identifiers, threat actor names, framework nomenclature, technical acronym density
Legal and compliance training (CLE, in-house legal team) 76–84% Latin legal terms, citation formats (28 U.S.C. § 1331), case names (Twombly, Iqbal)

These ranges reflect the underlying structural reason for the gap: ASR models are trained on large speech corpora that consist predominantly of general English speech — news broadcasts, conversational audio, podcast content, audiobooks. The vocabulary that appears in technical training video is severely underrepresented in these training corpora. A drug name that appears once per million words of general speech may appear once per hundred words of clinical training video. The ASR model has almost no acoustic-phonetic training on how the word is pronounced and no language model context to constrain its transcription.

The result is that the accuracy spread between content types is not a minor variance that averages out — it is a structural feature of how ASR systems work. The best ASR engine available today will still produce dramatically different accuracy results on studio-recorded soft-skills content versus manufacturing safety video with background noise and OSHA-specific vocabulary.

What representative-content sampling looks like in a vendor response

A vendor engaging in representative-content sampling does not usually disclose the content type in their RFP response. The claim simply reads “99% accuracy on enterprise training content” or “demonstrated 98.7% word accuracy across our customer base.” When the vendor has a sample file available, it is typically a professionally produced, studio-quality corporate communication video — a CEO message, a DEI training module from a major vendor, a professionally narrated product overview. These content types sit at the high end of the accuracy spectrum for any ASR system.

The pattern appears in three specific forms in vendor responses:

  1. Customer testimonial accuracy: “Our clients report 98–99% accuracy on their training libraries.” This reflects a self-selected group of customers who are satisfied enough to provide a testimonial, using a self-reported measurement, on content types that were not disclosed in the testimonial. The customers who had poor accuracy on technical content either churned (and are not providing testimonials) or had content types where the system performed adequately.
  2. Demo file accuracy: “Please see the attached sample [link to a leadership training video caption file].” The sample is hand-selected from the vendor’s portfolio to demonstrate their best-case performance. It does not represent how they will perform on your content unless your content is identical in type, vocabulary, and audio quality.
  3. Aggregate accuracy across all customers: “Our system averages 97.3% word accuracy across 12,000 monthly hours of content.” This aggregate includes the full range of content types, but the average is dominated by the most common content type in the portfolio — which, for most enterprise caption vendors, is corporate communications and soft-skills training rather than technical vertical content.

How to detect representative-content sampling

The detection question is simple but must be asked explicitly: “What content type was used to produce the accuracy figure or demonstration in your RFP response?” If the vendor cannot answer, or answers with “a range of enterprise training content,” treat the figure as a soft-skills-representative sample until proven otherwise.

The corrective for this failure mode is to require that any accuracy demonstration be scored on content that matches your actual training library. In the information request, ask the vendor to caption a 5–10 minute sample drawn from your own library — or, if you cannot share real content, from a synthetic sample that matches your vocabulary domain. The vendor accuracy evaluation methodology post describes how to construct a diagnostic content sample that reliably exposes vendor performance on your specific content type.

For the pre-pilot evaluation stage, even a simple content-type disclosure is useful. A vendor who says “our 99% figure is based on soft-skills and corporate communications content; we have separate benchmarks for technical content by vertical, and our medical vertical benchmark is 94% pre-glossary” is giving you actionable information. A vendor who says “99% across all content types” without further qualification has told you nothing about performance on your content.

The vocabulary accuracy layer

Separate from the content-type accuracy gap, there is a vocabulary-level accuracy consideration that most RFP responses do not address at all. Even within a content type, an ASR system that has never been exposed to your organisation’s specific product names, internal tool names, people’s names, and acronyms will fail on those terms even if it achieves adequate general accuracy on the surrounding text.

This is the problem that glossary-biased decoding addresses: by providing the ASR model with your organisation’s specific terminology ahead of transcription, accuracy on those terms can improve from 60–75% to 95–99%. But if a vendor’s accuracy claim was derived from content that does not include your organisation’s vocabulary, the claim does not reflect post-glossary accuracy — it reflects performance on content that your system will never see.

Add a specific question to your accuracy information request: “Does your reported accuracy figure reflect pre-glossary or post-glossary performance? If post-glossary, what was the glossary size and term category composition for the sample?” The vendor’s ability to answer this question is a signal of how sophisticated their quality programme actually is.

Failure mode 3: WCAG 2.1 AA claims without DCMP per-cue scoring protocol

What WCAG 2.1 AA actually requires

WCAG 2.1 AA Success Criterion 1.2.2 (Captions Prerecorded) requires that “captions are provided for all prerecorded audio content in synchronised media.” The criterion does not specify an accuracy percentage in its text. The 99% figure enters the picture through enforcement practice: the DCMP (Described and Captioned Media Program) Captioning Key, which is the quality standard referenced in OCR complaint resolution agreements and DOJ enforcement guidance, specifies that captions must achieve at least 99% word accuracy to be considered accessible. See the WCAG captions prerecorded reference for technical detail on what SC 1.2.2 requires.

The critical point for evaluating vendor claims is that WCAG 2.1 AA compliance is not self-certifiable. A vendor who says “our captions are WCAG 2.1 AA compliant” or “we guarantee 99% WCAG accuracy” has not told you anything about how they measure that compliance or whether their measurement method produces numbers that are comparable to what an enforcement body would accept as evidence of compliance.

The gap between aggregate WER and DCMP per-cue scoring

The most important technical distinction in vendor accuracy claims is between aggregate word error rate (WER) and DCMP Captioning Key per-cue scoring. These are different measurement methods that produce substantially different numbers from the same caption file, and most vendors who claim WCAG compliance are using aggregate WER rather than per-cue DCMP scoring.

Aggregate WER is calculated by counting all word substitution, insertion, and deletion errors across the entire transcript, dividing by the total word count, and subtracting from 1. A 10,000-word video with 100 word errors has a WER of 1% — or an accuracy of 99%. But those 100 errors might be distributed evenly (one error per 100 words) or concentrated (50 errors in a 500-word technical section, zero errors elsewhere). Aggregate WER cannot distinguish between these distributions, and it does not identify whether errors are concentrated in the caption cues where a deaf or hard-of-hearing viewer is most reliant on caption accuracy for comprehension.

DCMP Captioning Key per-cue scoring evaluates each caption cue (a discrete timed text segment, typically 2–6 seconds) independently. A cue with 2 errors in 8 words fails the cue-level quality check regardless of how accurate the surrounding cues are. The DCMP protocol tracks both the aggregate error rate and the per-cue error density, and classifies errors by severity (verbatim accuracy failures versus timing failures versus formatting failures). A caption file that achieves 99% aggregate WER but has a cluster of cue-level errors in a technical passage where critical vocabulary is being introduced may fail the DCMP quality check even though the aggregate number is compliant.

The practical implication is that aggregate WER always produces a better (higher) accuracy number than DCMP per-cue scoring for the same caption file. A file that scores 98.5% on aggregate WER might score 94% or 95% under DCMP per-cue analysis because the errors are concentrated in a few cues rather than distributed evenly. A vendor who claims “99% accuracy” using aggregate WER is almost certainly not achieving 99% under DCMP per-cue scoring unless they have specifically benchmarked against the DCMP protocol.

What WCAG 2.1 AA compliance evidence looks like without DCMP reference

In an RFP response, the failure mode typically appears as follows:

“We are fully compliant with WCAG 2.1 AA Success Criterion 1.2.2 and provide captions that meet or exceed the 99% accuracy standard. Our quality control process ensures that all delivered caption files pass our internal quality review before delivery, guaranteeing WCAG 2.1 AA compliance.”

This response claims WCAG 2.1 AA compliance and references a 99% accuracy threshold. It does not:

A vendor with a genuine WCAG 2.1 AA quality programme would be able to say: “Our captions are scored against the DCMP Captioning Key protocol on a per-cue basis. Files that do not achieve 99% accuracy under DCMP scoring are returned for revision before delivery. We can provide our scoring template on request.” The difference between these two responses is the difference between a marketing statement and a verifiable quality claim.

Detecting this failure mode in practice

The detection question is direct: “When you claim WCAG 2.1 AA accuracy, which measurement protocol produces the 99% figure? Is this aggregate WER or DCMP Captioning Key per-cue scoring?”

A vendor with a genuine quality programme will answer this question without hesitation because they track it internally. A vendor who cannot answer or responds with “our human reviewers ensure 99% accuracy” has confirmed that no protocol-based measurement is occurring. Human review without a scoring protocol does not produce a number that can be compared to the DCMP standard.

The caption QA methodology post explains the DCMP Captioning Key scoring process in detail, including how to apply it yourself during a pilot. The error rate calculator post provides the formulas. These resources allow you to independently verify whether a vendor’s claim holds up when measured against the protocol they should be using.

What genuine WCAG 2.1 AA accuracy evidence looks like

After establishing what vendor RFP accuracy claims typically fail to demonstrate, it helps to describe what genuine compliance evidence requires. A vendor who can produce all four components described below has given you the foundation for a meaningful pilot evaluation rather than a leap of faith.

Component 1: content type specification

Genuine accuracy evidence specifies the content type in enough detail to allow you to assess whether the sample is diagnostic for your content library. This does not mean the vendor must have a sample from your exact industry — it means they must describe the content they used in terms specific enough to enable comparison.

Adequate content type specification: “The accuracy figure is derived from a 45-minute manufacturing safety training video featuring machine operation procedures, lockout/tagout safety steps, and PPE requirement explanations. Audio was recorded in a production facility environment with some background noise. All technical vocabulary was within the domain of OSHA 29 CFR 1910.147 compliance training.”

Inadequate content type specification: “Our accuracy figure is derived from a representative sample of enterprise training content.”

Component 2: reference transcript source and production method

Accuracy measurement requires a ground truth: a verbatim human transcript of the audio that serves as the reference against which the caption file is scored. The reference transcript must be produced by a human transcriptionist working from the audio alone — not by the captioning system being evaluated, not by cleaning up the AI output, and not by using a pre-existing script that may differ from what was actually spoken.

The reference transcript issue is one of the most common sources of inflated accuracy numbers. If a vendor produces their accuracy figure by comparing their AI output against a transcript produced by the same AI system (or a cleaned version of it), the “errors” that would be caught by an independent human reference transcript are invisible. The score is correct by construction, not by quality.

Adequate reference transcript disclosure: “Reference transcripts for accuracy scoring are produced by a human transcriptionist who has not seen the caption output, working from the raw audio file. The transcriptionist’s output is the ground truth for scoring. We use a separate transcription vendor for reference transcript production to avoid any feedback between the captioning system and the scoring input.”

Inadequate reference transcript disclosure: “Accuracy is measured against our quality baseline.” (Circular.) “Accuracy is measured against client-provided scripts.” (Scripts frequently differ from spoken audio, which produces inflated WER.)

Component 3: DCMP Captioning Key methodology

The vendor should be able to name the DCMP Captioning Key as the scoring methodology, describe how they apply it (per-cue error counting), and explain how the error categories (verbatim accuracy, synchronisation, formatting) are weighted in their quality review.

The reason this matters beyond the aggregate number is that WCAG 2.1 AA enforcement — in OCR complaint investigations, in DOJ enforcement actions, and in the resolution agreements that have resulted from ADA Title II compliance proceedings — consistently references the DCMP Captioning Key as the applicable standard. A vendor who has never been through a compliance audit may be unaware that their internal accuracy metric, however carefully managed, does not map to the DCMP protocol. A vendor who has provided captions in a regulated or institutionally compliance-aware environment (healthcare, higher education, state government) will be familiar with the DCMP standard and will be able to explain their methodology in terms of it.

Component 4: third-party or client-verified scoring

Self-reported accuracy is the lowest-evidentiary-value form of accuracy claim. Third-party verified accuracy — scored by an independent party using a disclosed methodology on a content sample you can review — is the highest. Between these two poles, client-verified accuracy is intermediate: a client who has independently scored a vendor’s output using a disclosed methodology and can report those results carries more evidentiary weight than the vendor’s own internal QA number.

In an RFP response, the relevant question is not “what do you report for accuracy?” but “can you provide a reference customer who has independently measured your accuracy on content similar to ours and who is willing to describe their methodology and result?” A vendor who can connect you with such a reference has provided the strongest available pre-pilot accuracy signal.

The alternative is to require the vendor to caption a diagnostic content sample as part of the RFP process, which you then score independently. This converts the RFP response from a claims-evaluation exercise into a preliminary accuracy test. The caption vendor pilot programme design post covers how to structure a full controlled pilot; for the RFP evaluation stage, a short diagnostic sample (5–10 minutes of representative content) scored against a human reference transcript using DCMP methodology will produce more actionable differentiation than any amount of claim analysis.

The information-request template

When a vendor RFP response contains an accuracy claim that cannot be evaluated as stated — which, based on the three failure modes above, describes the majority of responses — send a standardised information request before scoring the accuracy criterion. The following template covers the required information for all three failure modes.

Information request: accuracy methodology disclosure

Send to: vendor primary contact
Deadline: [typically 5–7 business days before scoring completes]
Reference: Section [X] of your RFP response dated [date], which includes an accuracy claim of [percentage/statement]

To complete our evaluation of your RFP response’s accuracy claim, please provide the following information:

  1. Content type specification: What type of training content was used to produce the accuracy figure in your response? Please describe the content domain, vocabulary complexity, audio quality, and recording environment of the sample or population from which the figure was derived.
  2. Reference transcript method: How was the ground truth reference transcript for accuracy scoring produced? Was it produced by a human transcriptionist working from the raw audio? By a script that was edited to match audio? By a cleaned version of AI output? Please describe the reference transcript production process.
  3. Measurement protocol: What measurement protocol was used to produce the accuracy figure? Is this aggregate word error rate (WER) across the transcript, DCMP Captioning Key per-cue scoring, or a different protocol? Please name the protocol and, if applicable, the formula or counting method used.
  4. Scoring independence: Was accuracy scoring conducted by your organisation internally, by a third party, or by client verification? If by a third party, please name the firm. If by client verification, please provide contact information for a reference client who measured your accuracy independently and is willing to discuss their methodology and result.
  5. WCAG 2.1 AA compliance methodology: If your response includes a WCAG 2.1 AA compliance claim, please specify: (a) which SC (SC 1.2.2 for pre-recorded captions, SC 1.2.4 for live captions, or both) the claim applies to; (b) whether the 99% accuracy figure cited is aggregate WER or DCMP per-cue scoring; (c) whether your internal QA process uses the DCMP Captioning Key as its scoring standard.
  6. Content-type-specific performance data (optional but strongly preferred): If you serve clients in [describe your industry/content type], please provide accuracy figures specific to that content domain, including the methodology disclosure for those figures. If you have served clients with vocabulary similar to ours (please describe your content vocabulary domain), please describe the accuracy achieved on that content.
  7. Diagnostic sample: Please caption the attached 5-minute sample file [or: we will provide a sample file on request] and return the SRT/VTT output alongside your reference transcript for our independent scoring. This does not replace the pilot evaluation but will allow us to complete the RFP scoring stage with a verified accuracy data point.

Vendors who respond to this information request with complete answers have demonstrated that they have a quality programme sophisticated enough to track accuracy by content type, reference transcript, and scoring protocol. Vendors who cannot respond — or who respond only with reiterated marketing language — have confirmed the absence of a protocol-based quality programme. Both outcomes are useful information for the evaluation.

Scoring rubric for the accuracy criterion in RFP evaluation

The following rubric converts the three failure mode analysis into a scoring structure you can apply consistently across vendor responses. It is designed for an accuracy section that is weighted 35 points out of 100 in a standard RFP evaluation. Adjust the total weight to match your scoring framework; the relative allocation across sub-criteria should remain approximately constant.

Sub-criterion Maximum points Full credit Partial credit Zero
Methodology disclosure completeness 10 All four components disclosed (content type, reference transcript, protocol, independence) 2–3 components disclosed; one or two missing No methodology disclosure; bare accuracy number only
Content-type appropriateness 8 Accuracy figure derived from diagnostic content matching organisation’s vocabulary domain Accuracy figure derived from content with partial vocabulary overlap; domain disclosed but not matching Content type undisclosed; figure derives from soft-skills or general content
DCMP Captioning Key protocol 9 DCMP per-cue scoring explicitly named; methodology confirmed in information request response 99% accuracy claimed; aggregate WER confirmed in response to information request WCAG 2.1 AA claimed; measurement protocol not disclosed or confirmed as non-DCMP
Scoring independence 5 Third-party verified or reference client with independent score; contact information provided Client-reported accuracy without independent methodology; or internal QA with documented process Accuracy entirely self-reported; no reference available
Diagnostic sample (if provided) 3 Sample captioned and returned for independent scoring; result verified independently Sample provided but reference transcript unavailable for independent scoring No sample provided; request declined

Applying this rubric to real vendor responses will quickly reveal that most vendors score between 3 and 15 points on a 35-point accuracy criterion — not because they are inaccurate captioning vendors, but because they lack the methodology documentation to support a higher score. This is useful information: it separates vendors who have invested in measurement infrastructure from those who have not. It also gives you a specific conversation to have with a vendor who scores poorly: “You scored 8/35 on accuracy in our RFP evaluation because your methodology disclosure was incomplete. Here is what we need to improve that score before we consider moving to pilot.”

The vendor’s response to that conversation is itself a signal. A vendor who immediately provides the missing documentation has it. A vendor who cannot produce it in 48 hours almost certainly does not have the underlying quality infrastructure the documentation would represent.

For a worked example of how accuracy scoring fits into the full RFP evaluation framework — including section weighting, vendor scoring tables, and how to handle the pilot stage transition — see the captioning RFP playbook.

Eight red-flag patterns in vendor RFP accuracy responses

The following patterns reliably indicate that an accuracy claim is unverifiable or overstated. Each red flag maps to one or more of the three failure modes described above.

Red flag 1: “99%+ accuracy on all content types”
No captioning system achieves 99%+ accuracy on all content types without content-type-specific tuning. This claim is either false or uses a measurement method that inflates accuracy (aggregate WER on mixed content) rather than per-cue DCMP scoring. At minimum, ask for performance data broken out by content type. If the vendor cannot provide disaggregated data, treat the “all content types” claim as a soft-skills sample.
Red flag 2: “Human review guarantees 99% accuracy”
Human review is not a guarantee without a scoring protocol. Human reviewers vary in their correction thoroughness; reviewer A may correct 95% of errors and reviewer B 75% on the same file. Without a DCMP-referenced QA protocol that defines what must be corrected and how to verify it, “human review” is a process description, not a quality outcome. Ask: “What quality protocol does your human review follow? How is post-review accuracy measured? What is the re-review trigger for files that do not meet the threshold?”
Red flag 3: accuracy figure in the response is listed as a range (“98–99.5%”)
Accuracy ranges without content-type breakdown suggest that the range reflects content-type variance rather than measurement uncertainty. A vendor who understands their accuracy across content types can usually narrow the range for a specific content type. A range that spans the entire portfolio suggests the vendor is reporting aggregate performance without understanding which content drives the low end of the range.
Red flag 4: WCAG 2.1 AA compliance is listed as a checkbox rather than a described process
In an RFP compliance matrix where vendors check “Yes/No” to capability requirements, a simple “Yes” to “WCAG 2.1 AA compliant” has no evidentiary value. WCAG 2.1 AA compliance for captions requires a specific measurement methodology and ongoing QA process. A checkbox answer indicates that the vendor has not thought through what compliance requires operationally. Ask for the supporting documentation.
Red flag 5: accuracy demonstrated with a client testimonial that quotes a percentage
Client testimonials that include accuracy percentages (“Since switching to [Vendor], our caption accuracy has improved to 99%”) are appealing but carry low evidentiary value. The client is typically reporting a subjective perception, not a DCMP-scored measurement. The content type and measurement methodology behind the percentage are almost never disclosed in testimonials. Treat testimonial-embedded percentages as marketing evidence, not technical evidence.
Red flag 6: vendor declines to caption a diagnostic sample during the RFP process
A vendor who declines to provide a sample caption during the RFP evaluation stage is declining to produce evidence. There are legitimate operational reasons a vendor might defer the sample to the pilot stage (resource constraints, content sensitivity), but if the option is offered and the vendor declines without explanation, it is a signal that the sample would not support their claims.
Red flag 7: vendor references its own internal quality reports as evidence
Internal quality reports produced by the vendor are the weakest form of accuracy evidence. They are produced under the vendor’s own methodology, on the vendor’s choice of content, without independent verification. A vendor who offers to share their internal quality report as evidence of accuracy is offering the equivalent of a self-certification. More useful: the methodology the report uses, so you can assess whether it is DCMP-referenced, and whether you can replicate it independently.
Red flag 8: accuracy claim does not address vendor-specific vocabulary (glossary) performance
General accuracy figures do not reflect performance on organisation-specific vocabulary. A vendor who claims “99% accuracy” but does not address how their system handles proper nouns, product names, and internal acronyms has left out the most failure-prone category for L&D content. Ask specifically: “What accuracy do you achieve on customer-specific proper nouns and product terminology that do not appear in general ASR training corpora? What is your glossary or vocabulary customisation capability?” See the glossary architecture post for the technical framework this question is probing.

Five green-flag patterns that indicate a credible accuracy programme

Green flags are patterns that indicate a vendor has invested in measurement infrastructure and quality programme design beyond basic ASR deployment. They do not guarantee adequate performance on your content, but they distinguish vendors who understand their accuracy from those who are approximating it.

Green flag 1: accuracy figure is disaggregated by content type
A vendor who can say “our aggregate accuracy is 96%, broken down as follows: soft-skills training 97%, clinical training 88% pre-glossary, engineering technical training 83% pre-glossary” understands their system’s performance envelope. This disaggregation is only possible if the vendor has been measuring accuracy by content type — which means they have a measurement infrastructure, not just a marketing number. It also allows you to select the figure most relevant to your content type.
Green flag 2: vendor names DCMP Captioning Key as their scoring standard without prompting
A vendor who references the DCMP Captioning Key in their RFP response without being asked has demonstrated familiarity with the enforcement-relevant quality standard. This indicates either that they serve regulated markets where the DCMP standard has been invoked, or that their quality team has proactively aligned to the standard. Either indicates a more mature quality programme than a vendor who claims WCAG compliance without reference to any specific protocol.
Green flag 3: vendor offers a scored reference transcript with a sample file
A vendor who provides both a sample caption file and the reference transcript used to score it — allowing you to independently verify the accuracy score — is providing the highest-evidentiary-value form of pre-pilot accuracy evidence. The ability to reproduce the vendor’s stated accuracy on their own sample using your own scoring is a strong indicator that the claim is real rather than marketing-derived.
Green flag 4: vendor has a structured process for content-type-specific accuracy estimation
Some vendors offer a pre-sales accuracy estimate: share a short sample from your content library, and they will caption it and provide a scored estimate before you commit to a pilot. This is the most direct form of diagnostic evidence. The key question is whether the estimate uses DCMP methodology or aggregate WER — but the existence of a structured estimation process is itself a positive signal.
Green flag 5: vendor can describe how accuracy is monitored post-deployment
Pre-pilot and pilot accuracy evidence addresses a point-in-time performance question. Post-deployment monitoring addresses the ongoing question: does performance degrade over time, and how does the vendor detect and correct it? A vendor who can describe their post-deployment accuracy monitoring process — periodic QA sampling, error category tracking, client feedback loops, glossary refresh cadence — has a quality programme designed for sustained compliance rather than a one-time demonstration. See the caption feedback loop post for the framework that describes how ongoing accuracy improvement should work.

Connecting RFP response evaluation to the pilot and contract stages

RFP response evaluation is a gate, not a destination. The goal of this stage is to advance vendors with credible accuracy programmes to the pilot stage and eliminate vendors whose accuracy claims are entirely unverifiable. It is not expected to produce a final vendor selection — that requires direct performance evidence from a controlled pilot on your content.

From RFP evaluation to pilot design

Vendors who pass the RFP accuracy evaluation stage — either by providing complete methodology disclosure, or by captioning a diagnostic sample that you independently scored at an acceptable level — advance to the pilot. The pilot programme design post covers how to structure the pilot to produce defensible accuracy data before signature. Key elements include:

The RFP evaluation stage failure modes have a direct impact on pilot design. A vendor who disclosed that their accuracy figure is from soft-skills content but your library is medical training should be required to demonstrate performance specifically on medical vocabulary content during the pilot — not on general training content. A vendor who could not disclose their scoring methodology during the RFP stage should be evaluated during the pilot using your own DCMP scoring, not their reported score.

What to include in the vendor contract regarding accuracy

Contract-stage accuracy provisions should reflect what you learned during the RFP evaluation and pilot. Key elements from the vendor SLA and contract review checklist:

The accuracy provisions in the contract should not repeat the RFP response claims verbatim — they should specify the conditions and measurement methodology that will govern the relationship. A vendor who claimed “99% accuracy” in the RFP response should be willing to commit to “99% accuracy measured against the DCMP Captioning Key protocol on content within the scope of the master service agreement” in the contract. If they are not willing to make that commitment, the RFP claim was not a genuine representation of their capability.

Handling accuracy failures post-deployment

Even with a well-designed pilot and a DCMP-referenced accuracy SLA, accuracy failures will occur post-deployment on content that was not represented in the pilot sample — particularly on new product vocabulary that postdates the glossary build, on audio quality variations outside the pilot scope, and on content types that were inadequately sampled in the test corpus. The caption feedback loop post describes the structured error correction and glossary update process that converts accuracy failures into accuracy improvements over time.

The key operational mechanism is the QA sampling process: periodic random sampling of delivered caption files, scored against DCMP criteria, with error categories tracked over time. This allows you to detect whether accuracy is degrading on a specific content category (which would indicate a vocabulary drift problem requiring a glossary update) versus a general system degradation (which would indicate a vendor-side quality issue requiring remediation conversation).

The caption programme annual review process provides the cadence and structure for reviewing vendor accuracy performance at the relationship level, including when to negotiate contract amendments and when to initiate a re-procurement process.

Glossary and vocabulary accuracy: the layer most RFPs skip

The three failure modes above focus on the accuracy claims that vendors typically include in their RFP responses. There is a fourth accuracy dimension that most RFPs do not ask about at all, and that most vendor responses do not address: vocabulary-specific accuracy on organisation-defined terminology.

Why general accuracy does not predict vocabulary accuracy

A vendor who achieves 95% accuracy on a general training content sample will achieve materially lower accuracy on your organisation’s specific product names, internal tool names, people’s names, and industry acronyms — unless those terms appear in ASR training corpora or in the vendor’s glossary customisation infrastructure. The reason is the same structural gap described in the content-type analysis: terms that appear rarely in general speech have little or no acoustic-phonetic model coverage in the ASR system, and the language model has minimal context from which to constrain their transcription.

The specific failure patterns by term category:

The practical effect is that a vendor who achieves 95% accuracy on a general training sample may achieve 70–80% accuracy on content dense with your specific product vocabulary. The gap narrows with glossary-biased decoding but only if the vendor has a functional glossary customisation capability that can be populated with your specific terminology before captioning begins. The glossary architecture post covers how to structure the glossary for maximum coverage; the glossary-biased captioning post explains the technical mechanism behind glossary accuracy improvement.

Adding vocabulary accuracy to the RFP accuracy section

Supplement your standard accuracy questions with a vocabulary-specific accuracy section. Minimum required questions:

  1. “Describe your glossary or vocabulary customisation capability. What types of terms can be added, how are they specified, and at what point in the captioning workflow are they applied?”
  2. “What accuracy improvement do you typically observe on organisation-specific proper nouns and product terminology after glossary customisation? Please provide a data point from a customer with a comparable vocabulary profile.”
  3. “What is your process for updating the glossary when our product vocabulary changes (new product launches, rebranding events, acquisitions)?”
  4. “Do you have a glossary update SLA? What is the turnaround time for adding new terms submitted after a product release?”

The answers to these questions reveal whether a vendor has a systematic approach to vocabulary accuracy or whether “glossary customisation” means a static word list with no update workflow. For an L&D team at a technology company with a rapidly evolving product vocabulary, a static glossary that is not updated after each product release will degrade in accuracy quality within weeks of deployment. The glossary maintenance workflow post describes the structured process for keeping glossary accuracy current through product cycles.

Budget and cost implications of accuracy evaluation

Rigorous accuracy evaluation during the RFP stage has direct budget implications that are worth making explicit before you begin the process.

Cost of reference transcript production

If you require vendors to caption a diagnostic sample as part of the RFP evaluation (rather than deferring this entirely to the pilot), you will need a human-produced reference transcript of that sample to score the output. Reference transcript production typically costs $1.50–$4.00 per audio minute from a professional transcription service, depending on audio quality, speaker count, and turnaround time. For a 10-minute diagnostic sample, this is $15–$40 — a negligible cost relative to the value of having an independent accuracy data point before advancing any vendor to the pilot stage.

This cost is worth treating as a standard line in the caption procurement budget. The caption programme budget planning guide covers how to structure the full procurement budget across vendor evaluation, pilot, and ongoing operations.

Cost of inaccurate vendor selection

The direct cost of selecting a vendor whose actual accuracy does not match their RFP claim is the ongoing labour cost of manual caption correction. The hidden half-FTE cost post documents the correction labour model: at 80–85% raw accuracy, a 30-minute training video requires approximately 45–60 minutes of manual correction before it meets the 99% WCAG threshold. For a team producing 10 hours per month of training video, that is 90–120 hours per month of correction labour — equivalent to a half-FTE fully absorbed in caption correction rather than content creation.

The indirect cost is compliance exposure. A vendor whose captions do not meet WCAG 2.1 AA standards is not providing a compliance solution regardless of what they claimed in the RFP. The organisation bears the compliance risk while paying for a service that does not reduce it. Thorough RFP accuracy evaluation is the mechanism that reduces the probability of this outcome.

The cost comparison between rigorous evaluation ($40 in reference transcript costs, 5–10 hours of evaluation time) and the consequences of poor vendor selection (half-FTE correction labour, compliance exposure, vendor transition costs described in the vendor transition playbook) makes the evaluation investment straightforward to justify.

Using this framework with your existing vendor

The RFP accuracy evaluation framework is most obviously applicable when evaluating new vendors for an initial procurement. But it is equally useful as an audit framework for an existing vendor relationship where you have not previously measured accuracy using DCMP methodology or content-type-specific sampling.

If you are currently working with a caption vendor and have not verified their accuracy using the framework described above, consider requesting the following during your next contract renewal or annual review:

The annual review process post covers how to structure the full vendor performance review. The specific questions in the RFP accuracy evaluation framework translate directly into annual review questions: is the vendor still measuring accuracy using the protocol they described during procurement? Is the content-type accuracy profile still representative? Has vocabulary accuracy degraded as the product has evolved?

A vendor who welcomes this review and can produce the documentation it requires has the quality infrastructure to sustain long-term compliance. A vendor who cannot produce documentation that should exist under their own stated quality programme has revealed a gap worth addressing before a compliance audit makes it a more expensive problem. See the caption compliance programme build post for the governance framework that makes vendor oversight a systematic process rather than an ad hoc one.

GlossCap and RFP accuracy evaluation

GlossCap approaches accuracy claims using the framework described above. Our accuracy figures are derived from content that matches the vocabulary domain of the content being captioned — engineering API tutorials are not benchmarked against soft-skills video. We apply DCMP Captioning Key per-cue scoring as the quality gate for all delivered caption files, not aggregate WER. Our reference transcripts for quality scoring are produced by human transcriptionists working from raw audio, not from cleaned AI output.

For L&D teams evaluating GlossCap as part of an RFP process: we will provide a methodology disclosure, a diagnostic content sample caption with reference transcript, and a DCMP-scored accuracy report. We apply glossary-biased decoding using your organisation’s specific vocabulary before captioning begins, and our glossary update SLA is 48 hours for new terms submitted after a product release or rebranding event.

We designed the RFP accuracy evaluation framework described in this post partly because our own accuracy claims need to be verifiable against it. A vendor who cannot demonstrate their accuracy under independent scoring conditions should not be selling to organisations with WCAG 2.1 AA compliance obligations.

Get a verified accuracy disclosure for your captioning RFP

GlossCap provides WCAG 2.1 AA captions with DCMP per-cue scoring documentation, content-type-specific accuracy disclosure, and glossary-biased decoding for your organisation’s vocabulary. SRT and VTT output ready for Kaltura, TalentLMS, Canvas, Docebo, Cornerstone, and all major LMS platforms.

See pricing Try the widget

Frequently asked questions

Our RFP does not have a specific accuracy methodology requirement — can we still apply this evaluation framework?
Yes. The three-failure-mode evaluation framework is applicable to vendor responses regardless of how the RFP accuracy question was worded. Even if your RFP simply asked “what is your accuracy rate?”, you can supplement the evaluation with the information request template described above. The information request does not require revising the RFP — it is a post-response clarification process that most procurement frameworks permit. If your RFP scoring methodology does not weight methodology disclosure, you can still use the framework informally to distinguish vendors whose claims are verifiable from those whose are not, even if the distinction does not change the formal score. At minimum, use the framework to identify which vendors to advance to a diagnostic content demonstration before committing to a full pilot.
How do I know if a vendor’s 98% claim is actually better than a competitor’s 95% claim?
The number alone does not tell you. The comparison that matters is: 98% measured how, on what content, by whom versus 95% measured how, on what content, by whom. It is entirely possible — and reasonably common — that a vendor reporting 95% using DCMP per-cue scoring on diagnostic technical content that matches your vocabulary domain is a better-performing vendor for your use case than a vendor reporting 98% using aggregate WER on soft-skills studio content. Apply the methodology disclosure rubric to both claims before comparing the numbers. Once both claims are on the same methodological footing, the number comparison becomes meaningful.
Can I ask a vendor to caption a 5-minute sample as part of the RFP process without paying them?
This is a vendor-by-vendor determination. Most established caption vendors will caption a short sample as part of competitive procurement without charge, because the investment is proportionate to the contract opportunity. Some vendors have explicit policies against unpaid sampling and will defer the sample to a paid pilot. Both are reasonable positions. If a vendor is unwilling to provide a sample during the RFP stage under any circumstances, factor that into your overall assessment of their commitment to the procurement process. The alternative is to advance multiple vendors to a short paid pilot (“limited pilot” or “proof of concept” stage) at a nominal cost, and use that as the RFP accuracy verification step. The pilot programme design post covers how to structure a short proof-of-concept pilot that serves as the RFP accuracy gate before committing to a full pilot.
What should we do if we discover during the pilot that a vendor’s actual accuracy is materially lower than their RFP claim?
First, document the discrepancy: the vendor’s RFP claim, the pilot methodology you used, the content type, the reference transcript process, and the DCMP-scored result. Second, notify the vendor and give them a defined period to respond — typically 5–7 business days. Some vendors will immediately identify the cause of the discrepancy (vocabulary gap, audio quality issue, pilot content outside their accuracy envelope) and offer corrective action; others will dispute the scoring. Third, if the discrepancy is material (more than 5 percentage points between claimed and measured accuracy on content that matches your library) and the vendor cannot explain it with a plausible corrective path, eliminate them from consideration. A vendor who cannot achieve claimed accuracy under controlled pilot conditions will not achieve it in production. The decision to eliminate a vendor at this stage is less costly than the decision to continue with a vendor whose accuracy does not match what is needed for WCAG 2.1 AA compliance.
How does vendor accuracy change after a glossary is built and populated with our vocabulary?
Glossary accuracy improvement varies by term category and glossary size, but the typical pattern is: 5–15 percentage point improvement on organisation-specific proper nouns and product names after a targeted glossary build, with diminishing returns beyond 200–400 terms for most single-vertical deployments. A vendor who demonstrates 85% pre-glossary accuracy on your technical content may demonstrate 93–96% post-glossary accuracy on the same content if their glossary-biased decoding capability is functional. The important evaluation question is not only the pre-glossary accuracy figure but also the vendor’s glossary build process: how many terms can they load, how quickly can they apply glossary updates, and do they have a validation process to confirm that glossary terms are being recognised correctly after upload. See the glossary architecture post for the framework.
Should we require ISO 9001 or SOC 2 compliance as a proxy for caption accuracy quality?
ISO 9001 (quality management systems) and SOC 2 Type II (security and availability controls) are useful certifications for vendor due diligence, but they are poor proxies for caption accuracy specifically. ISO 9001 confirms that the vendor has documented quality management processes; it does not specify what those processes are or what quality level they target. SOC 2 Type II confirms controls around data security and system availability; it has no direct relevance to captioning accuracy. Neither certification tells you whether the vendor uses DCMP per-cue scoring, whether their reference transcript process is independent, or whether their accuracy figures are derived from diagnostic content. Use ISO 9001 and SOC 2 as part of vendor risk assessment, not as accuracy evidence. Caption accuracy specifically must be evaluated through the methodology disclosure and diagnostic content sample process described in this post.
How often should we re-evaluate vendor accuracy after initial selection?
At minimum: annually, as part of the caption programme annual review. More frequently if: (1) your product vocabulary is changing rapidly (product releases, acquisitions, rebranding), which can cause glossary accuracy to degrade within months; (2) the volume of content you are producing increases materially, which may change the distribution of content types and expose accuracy gaps in categories that were previously low-volume; (3) you expand into new content verticals that were not covered in the original pilot. The QA sampling process described in the caption QA methodology post provides an ongoing monitoring cadence that catches accuracy degradation before it becomes a compliance exposure, rather than waiting for the annual review cycle.

Other tools from the factory: