Procurement template · v2026.1

Captioning RFP template: 14 questions to ask any video captioning vendor in 2026

A captioning RFP that asks "do you produce captions?" gets you fourteen "yes" responses and no useful signal. The questions worth asking are the ones a generic transcription vendor cannot answer credibly — the ones that distinguish a workflow built for training video from a workflow built for podcast audio. This template is the questionnaire we wish every L&D and training-operations lead would send us; it works equally well for evaluating us and for evaluating our competitors. Score 0-3 per question, weight by importance, and the cheapest vendor that crosses your accuracy and integration thresholds usually wins on operator-time terms.

TL;DR

Run the 14 questions below as written. Each carries a "what good looks like" in its second paragraph — that's the answer pattern a fit-for-purpose vendor gives without prompting. Score 0 (no credible answer or evasion), 1 (partial), 2 (good), 3 (best-in-class). Weight: accuracy and glossary questions count 2×, integration and SLA questions count 1.5×, security and pricing count 1×. Final score over 70% means short-list. Below 50% means the vendor isn't fit-for-purpose for training video — they may still be fit-for-purpose for podcast or marketing-video work, but that's a different procurement.

Section 1 — Accuracy and the proper-noun question (questions 1–4)

1. What is your stated character-error-rate or word-error-rate floor on training video, and how is it measured?

The honest vendor will say something like "WER 5–8% on standard training audio, measured against held-out human reference transcripts on a sample we can describe." The evasive vendor will say "industry-leading accuracy" or quote a marketing number. WCAG 2.1 AA is generally read as ~99% character accuracy on the sampled spans an auditor inspects; ask whether the vendor's stated number is character-level or word-level — the two are not the same and the gap is large for content with technical terminology.

2. How does your accuracy degrade on content with a high density of proper nouns, acronyms, and technical terms?

This is the question that separates fit-for-purpose vendors from generic transcription. The honest answer is "it degrades materially — generic STT was trained on news and podcast audio, not on proprietary product names or domain terminology — and our workflow handles that gap with [glossary biasing / domain fine-tuning / human reviewer step]." A vendor that answers "no degradation" hasn't measured it.

3. Do you support per-customer glossaries that influence the model's decoding, or only post-hoc find-and-replace?

"Glossary-biased decoding" — feeding the customer's vocabulary directly into the language-model logits, so the model is more likely to choose the right surface form during transcription — is meaningfully better than post-hoc find-and-replace, because find-and-replace can only substitute when the misspelling is consistent. "tirzepatide" mangled into "ter zee paw tide" once and "tear zip a tide" the next time can't be cleanly replaced; it has to be decoded right the first time. This is GlossCap's core position; ask other vendors to describe their architecture in detail.

4. Show me a 60-second sample on our training audio with our glossary applied.

Not "with auto-captions" — with your glossary applied. A fit-for-purpose vendor will run a free 60-second pilot before quoting. The pilot should preserve every glossary term in the visible captions, with timestamps that line up with the audio. If a vendor wants money before the pilot, that's a process flag.

Section 2 — Format, integration, and workflow (questions 5–8)

5. Which caption file formats do you export, and which do you support natively versus convert from a master?

The full-spectrum vendor exports SRT, WebVTT, TTML, EBU STL, and SCC natively, with the option to round-trip through a master format without timing drift. A vendor that exports only SRT is a back-of-envelope solution that will not survive contact with a Kaltura or Brightcove pipeline. See our SRT, WebVTT, TTML, and EBU STL references for what each format is actually used for.

6. Do you integrate directly with our LMS or video host (TalentLMS, Docebo, Absorb, Kaltura, Panopto, Brightcove, Vimeo, Wistia)?

The right answer depends on which LMS you actually run, but the underlying capability the vendor needs is "we can post a caption file to your platform's caption-upload API as part of our workflow" — not "we email you the SRT and you upload it manually." Ask for the integration architecture diagram. We document the integration model for TalentLMS, Docebo, Absorb, Kaltura, Panopto, Vimeo, and Wistia; comparable depth from a competitor is the bar.

7. What is the reviewer step, and what does the reviewer UI look like?

For training video specifically, the difference between "captions" and "audit-grade captions" is a reviewer pass that catches the proper-noun failures the model still made. The reviewer UI matters: it should highlight glossary-applied terms, show source-line provenance, and allow corrections that feed the workspace glossary so the term doesn't break next time. A vendor whose reviewer step is "you check the SRT file in a text editor" is selling you the cost-saving by transferring the work to your team.

8. Can your workflow handle a back-catalogue of N hours in M weeks?

For an organisation re-captioning a back-catalogue under ADA Title II, 504, or AODA remediation pressure, throughput is the constraint. Ask the vendor to describe a concrete plan for 200 hours of video in 8 weeks: API throughput cap, reviewer-pool capacity, and parallelism. The vendor that hand-waves on this can't deliver it.

Section 3 — Security, privacy, and compliance posture (questions 9–11)

9. Where is the audio processed, and where is it stored?

The right answer names a region (US, EU, Canada) and a retention window (often 30, 60, or 90 days post-delivery, with a hard deletion option). For HR-related training video — particularly compensation, complaint-handling, or sensitive personnel material — this matters operationally. EU-based customers under GDPR ask the data-residency question first; US customers under HIPAA ask the BAA question first (see question 10).

10. Do you sign a BAA, a DPA, or both? Under what trigger?

For HIPAA-covered customers, training content is generally not PHI in the normal case (see our HIPAA training captions reference) — the BAA may not be operationally required for typical training. For EU customers, a GDPR-compliant Data Processing Addendum is essentially mandatory. The vendor that says "we don't do BAAs / DPAs" has self-selected out of regulated buyers.

11. What is your SOC 2, ISO 27001, or equivalent posture?

For procurement teams at 200+ employee buyers, this is a check-box question. SOC 2 Type II is the practical floor; ISO 27001 is more common in EU procurement. The honest small-vendor answer is "we are SOC 2 Type II in 2026 with [auditor] / not yet but on track for [date]"; the bluffing answer is "we follow industry-leading practices" with no certification path. We will be honest about where we are on this — see the launch essay — but the question is the right one to ask.

Section 4 — SLA, pricing, and contract (questions 12–14)

12. What is your turnaround SLA for a 60-minute training video, and what is the credit if missed?

"Same business day" with a financial credit is the strong answer; "best effort" with no credit is the bluff. For a remediation push under regulatory pressure, the SLA is on the critical path; for a steady-state workflow it matters less. Ask for the credit schedule in writing — the vendor that won't put it on paper doesn't actually have it.

13. What does pricing look like at our expected volume? Is it per-minute, per-hour, or seat-based?

The three pricing models suit different buyers: per-minute (Rev model) is good for sporadic use; per-hour blocks (3Play model) work for predictable monthly cadence; seat-based (GlossCap Team plan model) works when you want a flat operational expense and have steady internal usage. Ask the vendor to model your actual hour-volume — if their per-minute model is twice your seat-based model at your volume, that's a procurement signal. See our Rev vs GlossCap, 3Play vs GlossCap, and Verbit vs GlossCap pricing breakdowns.

14. What are the contract termination terms, and how do we get our data out?

The vendor that answers "30 days notice, all caption files exportable as a zip in industry-standard formats, glossary export available on request" is the vendor that's confident in retention through value. The vendor with a 12-month minimum, an early-termination fee, and a proprietary export format is selling you lock-in. Procurement teams at 200+ employee buyers reject this as a matter of policy; smaller buyers should too.

The scoring sheet

For each question, score:

0 — vendor evaded, refused to answer, or gave a marketing answer with no substance.
1 — vendor answered partially, with material gaps.
2 — vendor answered well, matching the "what good looks like" pattern in this template.
3 — vendor answered with detail beyond what we asked, demonstrating depth on the question.

Apply the weights:

Questions 1–4 (accuracy and glossary) — multiply each by 2.
Questions 5–8 (integration and workflow) — multiply each by 1.5.
Questions 9–14 (security, SLA, pricing, contract) — multiply each by 1.

Maximum possible weighted score = (4 × 3 × 2) + (4 × 3 × 1.5) + (6 × 3 × 1) = 24 + 18 + 18 = 60. A vendor scoring 42 (70%) or above is short-list-worthy. A vendor scoring 30 (50%) or below is not fit-for-purpose for training video.

How to run the RFP without burning a quarter

Pick 4 vendors max. Three is cleaner. The L&D-captioning market has roughly six serious mid-market vendors; running an RFP against more than four wastes everyone's time. Our recommended short-list pattern: one human-reviewed enterprise (Verbit or 3Play), one AI-first lower-cost (Rev AI or GlossCap), and one or two specialist depending on your vertical (a healthcare specialist if you're in healthcare; a higher-ed specialist if you're a university).
Ship the RFP with a 60-second pilot audio file. Pick a real piece of your training content with the proper-noun density that mangles for you today. The pilot eliminates marketing-claim noise immediately — it's the cheapest possible vendor signal.
Two-week response window. A vendor that needs four weeks to answer 14 questions is signalling they need four weeks to deliver each piece of work. Two weeks is enough for a real RFP response and short enough to filter for operational competence.
Reference calls with two existing customers in your size band. Skip the marquee logos — those customers got white-glove treatment. Ask for two customers in your headcount band running comparable hours of video. The reference-call signal is whether they know the customer-success contact's name and have actually used the product in the last month.
Pilot project before the contract. Even after the RFP picks a winner, run a 30-day pilot with a real 50-hour batch of video before signing a 12-month contract. The pilot catches integration-level surprises that the RFP can't.

Read why we built GlossCap