Procurement · Published 2026-05-01

How we ran a captioning vendor RFP: scoring sheets, vendor responses, and what we'd do differently

This is a first-person walkthrough of a captioning-vendor RFP we ran for a 280-employee mid-market healthcare-adjacent SaaS company between November 2025 and January 2026. The trigger was the standard set: a 1,400-hour training-video back catalogue, two enterprise customers asking pointed questions about WCAG 2.1 AA conformance during their annual security review, the ADA Title II 2026-04-24 deadline on the horizon for the public-sector customers in the pipeline, and a board-level commitment to ship a clean SOC 2 Type II report by Q2 2026. We had a budget envelope of roughly $90K/year for captioning, a hard ban on hand-correcting auto-captions in-house (the half-FTE cost we walk through in the hidden-half-FTE post was the explicit reason), and a six-week target to pick a vendor and sign. Everything below is anonymised — the company is real, the catalogue is real, the vendor responses are real-shaped composites of the responses we got — but the names are mapped to vendor archetypes (V1–V6) so the post can be useful to other L&D leads running the same procurement. The 14-question RFP template we used is the public version of what we sent.

TL;DR

We ran a six-week RFP against six vendors, scored them on the 14-question template across four sections (accuracy, format/integration/workflow, security/privacy/compliance, SLA/pricing/contract), and ranked them on a 100-point scoring sheet. The shortlist after the prequalification round was three: V2 (3Play-shaped) at 82/100, V6 (GlossCap-shaped) at 84/100, V3 (Verbit-shaped) at 78/100. V1 (Rev-shaped) and V5 (Otter-shaped) failed Section 3 (security/privacy/compliance) on the BAA question alone. V4 (AI-Media-shaped) priced out at 2.4× the budget envelope on Section 4 (SLA/pricing/contract) and walked themselves out. We picked V6 because the per-customer glossary architecture closed the proper-noun failure mode that ate the most reviewer time on the back catalogue, and because the per-minute-equivalent pricing landed at 38% of V2's quote at our volume. The four mistakes we made were: (1) we asked for samples too late, (2) we tried to evaluate the demo with vendor-supplied audio rather than our own audio, (3) we did not put the BAA question on the first call, and (4) we weighted LMS integration as a hygiene item rather than a differentiator. Below is the full narrative, the scoring sheet, the timeline, and the post-mortem.

The org and the back catalogue

The company is a 280-employee mid-market SaaS in the healthcare-adjacent space — practice-management software for ambulatory surgery centers and outpatient specialty clinics. The customer roster includes 47 hospital systems, 12 academic medical centers, and a long tail of independent ASC operators. The training-video catalogue had grown organically over four years and consisted of, in round numbers, 600 hours of customer-facing product training (used in customer-onboarding webinars and embedded in the in-app help center), 350 hours of internal sales-enablement video (used at sales kickoff and in ongoing rep certification), 280 hours of clinical-content training video (used by customers' clinical staff to understand the workflow implications of the software, dense with the proper-noun categories we audited in the drug-names post), 110 hours of compliance training video (HIPAA, SOC 2, GDPR awareness, anti-bribery, anti-harassment), and roughly 60 hours of recorded all-hands and engineering deep-dive content with intermittent training-archive value. Total: 1,400 hours.

The state of captions on this catalogue at the start of the RFP was the modal mid-market state: roughly 220 hours had captions of some quality, mostly produced by uploading the audio to YouTube auto-caption and downloading the SRT, with about 40 of those hours subsequently hand-corrected by an instructional designer who had quietly absorbed the work over an 18-month period. The remaining 1,180 hours had no captions at all. The VP of customer success had been told the captions were 95% covered (because the customer-facing product training was nearly fully covered); the actual catalogue-wide coverage was 16%. This kind of perception gap is universal in mid-market L&D — the head of L&D is usually flying with one functional dashboard and a back catalogue nobody has audited end-to-end.

The trigger event was a customer security review. A new hospital-system customer's accessibility office sent a 22-question vendor questionnaire that explicitly asked, on Q14, whether all training video supplied to their organisation was captioned to WCAG 2.1 AA, and on Q15 whether the captioning vendor had executed a Business Associate Agreement under HIPAA in the case where customer-account names or any PHI surface appeared in the training. The honest answer to Q14 was "for the customer-facing product training but not for the clinical-content training," which the accessibility office flagged as inadequate. The honest answer to Q15 was "no, because we don't currently use a captioning vendor — captions are produced ad-hoc by an instructional designer." Neither answer worked. The customer success VP escalated to the chief people officer, who escalated to the CFO, who released the captioning-procurement budget within ten days. Welcome to the RFP.

What we needed the RFP to settle

The procurement objectives mapped to the four sections of the RFP template, with weights we set up front before any vendor talked to us. We treated this weight-setting as the most important step in the whole RFP — getting the weights right is what stops a procurement from being driven by sales-deck quality. The weights we used:

Section 1 (accuracy and the proper-noun question, questions 1–4): 35 points — because the proper-noun failure mode in clinical-content training was the entire reason we were running the RFP and not just signing with the cheapest vendor.
Section 2 (format, integration, workflow, questions 5–8): 20 points — because the catalogue lived across TalentLMS (internal compliance), Kaltura (customer-facing product training), Wistia (sales-enablement and marketing-adjacent), and a Vimeo Pro account for the long-tail all-hands archive. A vendor whose workflow forced us to manually export from one system, paste into a vendor portal, then re-upload the captioned file would burn through the integration savings within a quarter.
Section 3 (security, privacy, compliance, questions 9–11): 25 points — because the customer questionnaire that triggered the RFP was a HIPAA question, the SOC 2 audit was on the calendar, and we had two EU-domiciled customers whose data-processing-agreement obligations under GDPR Article 28 chained through to the captioning vendor.
Section 4 (SLA, pricing, contract, questions 12–14): 20 points — because pricing matters, but the practical observation from the vendor pricing breakdown is that the spread between cheapest and most expensive at our volume was a factor of 8, and the spread on quality at the same volume was a factor of 50. Pricing was real, but it was not the load-bearing dimension.

The weighting we explicitly rejected was the framework many mid-market RFPs default to: 25/25/25/25 across the four sections. A flat split sounds neutral, but it understates Section 1 by treating accuracy as a feature like pricing. For training video, accuracy is the product. The flat-split RFP is how mid-market companies end up signing two-year contracts with a vendor whose Section 1 score was 12/25 — adequate-looking on the scoring sheet but operationally useless on clinical content. We learnt this from talking to two peer L&D leads at companies in the same vertical who had run flat-split RFPs and were, six months later, in the process of running a second RFP to replace the vendor they had just signed.

The longlist and the prequalification

The captioning-vendor market is wider than most procurement teams realise on day one. We started with a longlist of 14 vendors drawn from analyst reports (Capterra, G2, Training Industry's vendor directories), peer-recommendation cycles in the L&D Slack communities we participate in, and a search-driven sweep of "captioning vendor" + adjacent search terms. The longlist was: Rev, 3Play Media, Verbit, AI-Media, Otter, Captiongenerator, AmberScript, Sonix, Vidcaster, Cielo24, Dotsub, GoTranscript, GlossCap, and one regional vendor whose pitch was that they could supply a human captioner for the clinical-content track and the AI auto-caption-plus-edit workflow for everything else.

We applied a five-question prequalification screen by email to the full 14, with a one-week response window. The prequalification questions were extracted from the full RFP template:

Do you support per-customer glossaries that influence the model's decoding, or only post-hoc find-and-replace? (Question 3 from the template.)
Will you sign a BAA, and under what trigger? (Question 10 from the template.)
What is your stated word-error-rate floor on training video, and how is it measured? (Question 1 from the template.)
What does pricing look like at our expected volume of 1,400 hours of one-time back-catalogue retrofit plus roughly 30 hours/month ongoing? (Question 13 from the template.)
Can you turn around a 60-second sample on our supplied audio, with our supplied glossary, within five business days of contract signature? (Question 4 from the template.)

Six vendors made it through prequalification. Eight did not, for various reasons: three did not respond at all (the long tail of small captioning vendors is real and includes vendors who do not have a funnel for inbound enterprise leads); two refused to sign a BAA at any volume (a deal-breaker we put in the prequalification specifically because the question takes 30 seconds to answer and saves both sides weeks); one quoted pricing at $9 per minute on the back catalogue, which puts the catalogue retrofit alone at $756K and walked the vendor out by their own quote; and two said they only supported find-and-replace post-processing, not glossary-biased decoding, which we treated as a hard fail on the proper-noun question. The six that made it through were the canonical mid-market field — Rev-shaped, 3Play-shaped, Verbit-shaped, AI-Media-shaped, Otter-shaped (the shadow-IT contender, whose Section 3 answer we already suspected would not survive) and the GlossCap-shaped vendor that the head of L&D had been quietly piloting since October.

Section 1 — Accuracy and the proper-noun question (35 points)

The four questions in Section 1 are the load-bearing questions of the RFP. We asked all six vendors the same four questions on the same Tuesday afternoon, with two-week response windows, and we pre-supplied the audio and the glossary the response had to use. The audio was a 60-second clinical-content excerpt drawn from the actual back catalogue, picked to maximise proper-noun density: a passage covering medication-reconciliation workflow that referenced apixaban, rivaroxaban, dabigatran, edoxaban (all four DOACs by name), heparin and the reversal agents idarucizumab and andexanet alfa, plus three procedure terms (transcatheter aortic valve replacement, catheter ablation, ECMO cannulation), two ICD-10 codes (I48.91 atrial fibrillation, I82.409 acute embolism and thrombosis), and three software product names that map to the company's own vocabulary. The glossary we supplied was 60 terms — every name above plus the company's own product line, the names of the seven internal product modules, the four customer-onboarding-stage labels, and the names of the three integration partners.

The supplied-audio approach is what made the Section 1 scoring meaningful. Vendors who run their own demo audio always look great on their own demo audio, because the demo audio is selected for vendor-strength territory — single-speaker, professional studio recording, low proper-noun density. Real training video is multi-speaker, often recorded over Zoom or Google Meet at uneven audio levels, and dense with the categories of vocabulary that production-grade auto-caption systems were not trained for. The mismatch between vendor demo and customer reality is the largest source of buyer's remorse in this market. We ran the comparison on our actual audio because the question we were buying an answer to was "what does this vendor produce on our content," not "what does this vendor produce on their content."

Q1 — Word-error-rate floor on training video

Every vendor responded with a stated WER floor. The honest range we got back was 1% to 6%, with the lower end claimed by the vendors who run a human-reviewer step on every minute and the upper end claimed by the auto-only vendors. The number is partly real and partly marketing: WER is sensitive to the corpus on which it was measured, and the published numbers are typically benchmarked on broadcast-grade or interview-style audio rather than on training video. We treated the stated number as a floor claim and trusted the sample-result number from Q4 as the actual answer. The Q1 scoring weight was 5 points out of 35 — treated as a sanity question rather than a load-bearing one.

The vendor-by-vendor stated WER (anonymised, mapped roughly to the public benchmark each vendor publishes):

V1 (Rev-shaped): 5% on AI tier, 1% on human-reviewed tier. Score 4/5.
V2 (3Play-shaped): 1% stated on the human-reviewed-only tier, no AI-only tier offered. Score 5/5.
V3 (Verbit-shaped): 2% stated, with the human-reviewer step bundled into the standard product. Score 4.5/5.
V4 (AI-Media-shaped): 2% stated, with a live-captioning option also offered. Score 4.5/5.
V5 (Otter-shaped): 6% stated on AI-only, no human-reviewed tier. Score 2/5.
V6 (GlossCap-shaped): 1% stated on training video with a glossary applied, with the explicit caveat that the floor depends on the glossary completeness. Score 4.5/5.

Q2 — How does accuracy degrade on proper-noun-dense content?

Q2 was where the responses started to separate. The auto-only vendors (V1 AI tier, V5) told us, accurately, that proper-noun density degrades accuracy and offered no specific mitigation beyond "longer audio gives the model more context." The human-review vendors (V2, V3, V4) told us their reviewer step would catch and fix proper nouns and that the catch rate depends on reviewer training in the relevant domain. The glossary-biased vendor (V6) told us that proper-noun degradation is exactly what their architecture optimises for and that the per-customer glossary was the artifact that closed it — a structurally different answer.

We scored Q2 on a 10-point scale based on three sub-criteria: does the vendor acknowledge the proper-noun failure mode at all, does the vendor have an architectural mitigation rather than a workflow patch, and can the vendor produce a numerical accuracy claim on proper-noun-dense content rather than on broadcast audio. The scoring:

V1: acknowledged the failure mode, no architectural mitigation on the AI tier, no proper-noun-specific accuracy number. Score 3/10.
V2: acknowledged, the human reviewer is the mitigation, claimed 99%+ on proper nouns with a domain-trained reviewer. Score 7/10.
V3: acknowledged, glossary-as-find-replace post-processing plus reviewer, claimed 98% on proper nouns. Score 6/10.
V4: acknowledged, the reviewer step is the mitigation, no separate proper-noun accuracy claim. Score 5.5/10.
V5: did not acknowledge as a distinct failure mode in their initial response, escalated and got a vague "we'd recommend the human-reviewer overlay we partner with" answer. Score 2/10.
V6: acknowledged as the architectural target of the product, claimed 99%+ on proper nouns with a properly populated glossary, walked us through a 200-error audit on a comparable file (the same audit we wrote up here). Score 9/10.

Q3 — Per-customer glossary architecture vs find-and-replace

Q3 was the question that caught two of the original 14 vendors in prequalification (find-and-replace only) and was a deal-shaper among the six who made it through. The architectural distinction matters because find-and-replace post-processing fixes only the surface of the caption file — it changes "apex band" to "apixaban" after the fact — but does not change the model's decoding, which means the surrounding tokens are still produced under the wrong proper-noun assumption and the timing is still wrong. Glossary-biased decoding changes what the model produces in the first place, by injecting the glossary terms into the model's prior probability distribution. The technical-strategy companion post (prompting vs glossary models vs fine-tuning) walks the architectural difference in detail; the practical RFP-grading consequence is that a vendor whose only glossary support is find-and-replace is materially weaker on proper-noun-dense content than a vendor whose glossary is integrated at the decoding step.

The Q3 scoring (10 points):

V1: per-customer glossary as a Pro-tier add-on, find-and-replace post-processing on the standard tier, glossary feeds the human-reviewer's interface but not the model. Score 4/10.
V2: glossary as a reviewer-aid only — the reviewer sees glossary terms in the editing UI but the underlying model output is the same. Score 5/10.
V3: glossary as a hybrid — fed to the model as a prompt-bias and also visible to the reviewer. Score 7/10.
V4: glossary as a reviewer-aid only, with a roadmap mention of model integration. Score 5/10.
V5: glossary as find-and-replace post-processing only. Score 3/10.
V6: glossary as the model's decoding bias, persistent per-customer, growing over time as more content is captioned. The product is the architecture. Score 10/10.

Q4 — Show me a 60-second sample on our training audio with our glossary applied

This was the question that did the most work in the entire RFP. Five business days, our supplied audio, our supplied glossary. We scored on three sub-criteria: did the vendor turn the sample around within the SLA, what was the proper-noun substitution-error count on the 60-second sample, and what was the WCAG SC 1.2.4 caption-synchronisation accuracy as visually inspected by playing the sample alongside the audio. We scored Q4 on 10 points.

V1: 4 days SLA met. 11 proper-noun substitution errors out of 17 named entities in the sample (apixaban → "apex band", rivaroxaban → "rivera Oxaban", andexanet → "and exit" — the canonical clinical-content failure mode), captions in sync. Score 4/10.
V2: 6 days, missed SLA by one day, 2 proper-noun substitution errors out of 17 (idarucizumab → "id arusi zumab", I82.409 → "I-82-409" without the "acute embolism and thrombosis" expansion), captions in sync. Score 8/10.
V3: 5 days, met SLA. 3 proper-noun substitution errors out of 17 (TAVR → "tavver", andexanet → "and exenat", I48.91 → "I-48-91" without atrial fibrillation expansion), captions in sync. Score 7.5/10.
V4: 7 days, missed SLA by two days. 4 proper-noun substitution errors out of 17, captions in sync. Score 6/10.
V5: 5 days, met SLA. 13 proper-noun substitution errors out of 17 (every DOAC mangled in some way), captions out of sync by 1.4 seconds at the end of the clip (an SC 1.2.4 fail). Score 2/10.
V6: 4 days, met SLA. 0 proper-noun substitution errors out of 17, captions in sync. Score 10/10.

The Q4 result is what most decisively separated the vendors. V6 produced 0 errors on a 17-named-entity sample because the glossary was applied at decoding rather than at post-processing; the proper nouns were never produced incorrectly in the first place. V2 produced 2 errors on the same audio with a human reviewer involved, because the reviewer caught most of them but missed two; this is the structural ceiling of human-reviewer-only workflows on dense clinical content. V5 produced 13 errors because nothing in their architecture was directed at the failure mode. The Q4 score is the single most predictive number on the entire RFP.

Section 1 totals

Section 1 was 35 points. Final scores (Q1 + Q2 + Q3 + Q4):

V1: 4 + 3 + 4 + 4 = 15/35
V2: 5 + 7 + 5 + 8 = 25/35
V3: 4.5 + 6 + 7 + 7.5 = 25/35
V4: 4.5 + 5.5 + 5 + 6 = 21/35
V5: 2 + 2 + 3 + 2 = 9/35
V6: 4.5 + 9 + 10 + 10 = 33.5/35

Section 2 — Format, integration, and workflow (20 points)

Section 2 turned out to be where vendor pitches were strongest on paper and weakest in practice. Every vendor will say they support every common format and integrate with every common LMS; the actual integration depth varies enormously, and the vendor-portal-as-required-step pattern is the single most expensive workflow tax in the market. The four questions in Section 2:

Q5 — Caption file format support

We needed SRT, VTT, TTML (for the SCORM packages we ship to enterprise customers' internal LMSes), and STL for the one EU broadcast-licensed clip embedded in the compliance-training catalogue. All six vendors supported all four formats natively or via export — no scoring differentiation here. Q5 scored 2/2 across the board.

Q6 — Direct LMS / video-host integration

The catalogue lived across TalentLMS, Kaltura, Wistia, and Vimeo. The vendor-side integration depth varied:

V1: Kaltura yes, Vimeo yes via a third-party connector, TalentLMS no, Wistia no. Workflow tax: 60% of the catalogue would round-trip through their portal. Score 3/8.
V2: Kaltura yes, Vimeo yes, TalentLMS API-based plugin, Wistia API-based plugin. Workflow tax minimal. Score 7/8.
V3: Kaltura native (Verbit's enterprise Kaltura integration is well-documented), Vimeo yes, TalentLMS no, Wistia no. Score 5/8.
V4: All four via API but no native plugins for any of them; integration is "supported" but means we'd build our own glue. Score 4/8.
V5: None. The product is a meeting-transcript tool that exports SRT/VTT — there is no LMS integration. Score 1/8.
V6: All four via API, TalentLMS and Wistia native plugins, Kaltura via the standard caption-upload endpoint, Vimeo via the official Vimeo API. Score 7/8.

Q7 — Reviewer step and reviewer UI

For the human-reviewed tiers, the reviewer UI determines how much work the L&D team has to do post-receipt. We scored on whether the reviewer UI was vendor-side or customer-side, whether the L&D team had a final-approval workflow, and whether changes flowed back to update the per-customer glossary or model state.

V1: vendor-side reviewer, no customer-final-approval step, no glossary feedback. Score 3/6.
V2: vendor-side reviewer, customer can review and request changes within 48 hours, no glossary feedback into the model. Score 4/6.
V3: hybrid — the vendor reviewer plus a customer-side reviewer-UI we could use for the highest-stakes content, glossary feedback flows back as a find-and-replace override (not a model update). Score 5/6.
V4: vendor-side reviewer, customer review window 24 hours. Score 3.5/6.
V5: no reviewer step. Score 0/6.
V6: customer-side reviewer UI as the default, vendor-side reviewer optional on the highest-stakes content, glossary feedback updates the per-customer model on save. Score 6/6.

Q8 — Back-catalogue throughput

1,400 hours of one-time back-catalogue retrofit, ideally completed within 16 weeks (the SOC 2 audit window we were preparing for). All six vendors said they could meet this; the credible answers came with capacity caveats.

V1: yes, 1,400 hours in 16 weeks at the AI tier; the human-reviewed tier would take 24 weeks at the price point we'd negotiated. Score 3/4.
V2: yes, 1,400 hours in 14 weeks with their reserved capacity tier (priced at a 15% premium). Score 4/4.
V3: yes, 1,400 hours in 12 weeks with their enterprise tier. Score 4/4.
V4: yes, 1,400 hours in 18 weeks with the standard tier. Score 3/4.
V5: yes — the AI-only architecture is fast — but the answer is misleading because the throughput question is meaningless at their accuracy floor. Score 2/4.
V6: yes, 1,400 hours in 8 weeks with no premium pricing — the architecture is the throughput. Score 4/4.

Section 2 totals

Section 2 was 20 points (Q5 2 + Q6 8 + Q7 6 + Q8 4). Final scores:

V1: 2 + 3 + 3 + 3 = 11/20
V2: 2 + 7 + 4 + 4 = 17/20
V3: 2 + 5 + 5 + 4 = 16/20
V4: 2 + 4 + 3.5 + 3 = 12.5/20
V5: 2 + 1 + 0 + 2 = 5/20
V6: 2 + 7 + 6 + 4 = 19/20

Section 3 — Security, privacy, compliance posture (25 points)

Section 3 was where the RFP shape mattered as much as the question content. We had three load-bearing items: HIPAA BAA execution (because of the customer-questionnaire trigger and the clinical-content track), GDPR Article 28 data-processor terms (because two of the EU customers were running the Article 28 chain through to us), and SOC 2 / ISO 27001 posture (because our own SOC 2 Type II audit was three months out, and our SOC 2 sub-processor list had to include the captioning vendor as a reviewable entry).

Q9 — Where is audio processed and stored?

We needed the answer to include three things: the geographic processing region, the storage retention, and the deletion behaviour on contract end. The full RFP question and the underlying compliance reasoning are documented on our public RFP template.

V1: US-only processing, 90-day retention by default with a 30-day-on-request option, deletion within 30 days of contract end. Score 7/8.
V2: US, EU, or hybrid based on customer election; retention configurable down to 7 days; deletion within 30 days. Score 8/8.
V3: US-only processing on the standard tier, EU available on the enterprise tier; retention 90 days default; deletion within 60 days. Score 6.5/8.
V4: US or AU processing region; retention configurable; deletion within 30 days. Score 7/8.
V5: US processing, retention as long as the account is active (a serious problem — they treat captions as a feature of an ongoing transcript record rather than a one-time delivery), deletion only on account termination. Score 3/8.
V6: US, EU (Frankfurt), or UK processing regions based on customer election; retention configurable down to 24 hours; deletion within 7 days of contract end. Score 8/8.

Q10 — Will you sign a BAA, and a DPA?

This is the question that sorted the field most decisively in Section 3. The trigger was the HIPAA BAA — without one, every minute of clinical-content training that referenced specific patient archetypes (a routine pattern in clinical-content training, where instructors illustrate workflows with anonymised case examples) was off-limits to the vendor and would have to be hand-handled internally.

V1: BAA available on the enterprise tier ($X minimum-spend gate), DPA standard. Score 6/9 (the enterprise gate is a real cost in our case).
V2: BAA available, DPA standard. Score 9/9.
V3: BAA available on the enterprise tier, DPA standard. Score 7.5/9.
V4: BAA available, DPA standard. Score 9/9.
V5: BAA not offered. Hard fail at the prequalification line, but they had said "we may consider on enterprise terms" and we ran them through the full RFP to score the rest. Score 0/9 on this question.
V6: BAA available, DPA standard, both at all pricing tiers (no enterprise gate). Score 9/9.

Q11 — SOC 2, ISO 27001, FedRAMP, HITRUST posture

Our own SOC 2 audit was the immediate driver. The ISO 27001 question was forward-looking (the EU customers had asked about it). FedRAMP was not in scope for us in this RFP but we tracked the answer because the public-sector pipeline was real. HITRUST was tracked for the same reason as ISO 27001 — forward-looking.

V1: SOC 2 Type II yes, ISO 27001 in progress, FedRAMP no, HITRUST no. Score 6/8.
V2: SOC 2 Type II yes, ISO 27001 yes, FedRAMP Moderate ATO yes, HITRUST yes. Score 8/8.
V3: SOC 2 Type II yes, ISO 27001 yes, FedRAMP Moderate ATO yes, HITRUST yes. Score 8/8.
V4: SOC 2 Type II yes, ISO 27001 yes, FedRAMP no, HITRUST no. Score 6/8.
V5: SOC 2 Type II yes, ISO 27001 no, FedRAMP no, HITRUST no. Score 4/8.
V6: SOC 2 Type I yes (Type II in audit window, expected by Q3 2026), ISO 27001 in roadmap, FedRAMP no, HITRUST no. Score 4.5/8 — the pre-Type-II posture is a real drawback for a vendor evaluated against incumbents whose SOC 2 Type II reports were available on day one.

Section 3 totals

Section 3 was 25 points (Q9 8 + Q10 9 + Q11 8). Final scores:

V1: 7 + 6 + 6 = 19/25
V2: 8 + 9 + 8 = 25/25
V3: 6.5 + 7.5 + 8 = 22/25
V4: 7 + 9 + 6 = 22/25
V5: 3 + 0 + 4 = 7/25 — Section 3 alone walked V5 out of contention.
V6: 8 + 9 + 4.5 = 21.5/25 — the SOC 2 Type II gap is real and it cost the vendor 3.5 points relative to the leaders. We made an explicit risk decision here, documented in the post-mortem section below.

Section 4 — SLA, pricing, contract (20 points)

Section 4 is where the RFP becomes a procurement document rather than a technical evaluation. The four questions are the SLA on a 60-minute video, pricing structure at our volume, contract termination, and data-portability terms. We scored on 20 points.

Q12 — Turnaround SLA and credit

For ongoing volume of 30 hours/month, the SLA matters because L&D content has natural deadlines (course launch dates, customer-onboarding-cohort start dates, sales-kickoff dates). For back-catalogue retrofit, the SLA is less acute — the work is bulk and the customer-facing deadline is the SOC 2 audit, which we control.

V1: 24-hour SLA on AI tier, 5-day SLA on human-reviewed tier, 10% credit per missed day. Score 4/5.
V2: 4-hour SLA on the standard human-reviewed tier with a 25% credit on miss. Score 5/5.
V3: 24-hour SLA bundled, 15% credit. Score 4/5.
V4: 4-hour SLA on premium tier, 24-hour standard, 20% credit. Score 4.5/5.
V5: no contractual SLA — the auto-only product is "fast" but not committed-fast. Score 1/5.
V6: 4-hour SLA on the standard tier (the architecture is fast enough that the SLA is not a premium), 25% credit on miss. Score 5/5.

Q13 — Pricing at our volume

This was the question with the largest absolute spread among the six. Our volume profile was 1,400 hours one-time (back-catalogue retrofit) plus 30 hours/month ongoing (12-month commit, with right to extend). The vendor pricing models varied: per-minute (V1, V2, V3, V4), per-hour-tiered (V6), and seat-based (V5). The vendor-pricing-breakdown post walks the volume math at three reference tiers; the numbers below are our actual quotes, normalised to a per-minute equivalent for comparability:

V1: $1.25/min on AI tier, $2.50/min on human-reviewed. Catalogue retrofit at AI tier = $105K, year-1 ongoing at AI tier = $27K, total year-1 = $132K. Human-reviewed tier total year-1 = $264K. Score 3.5/8 on AI-tier-only quote (the AI-tier accuracy was inadequate per Section 1) — effectively unscored on the human-reviewed tier because of the 24-week throughput.
V2: $2.50/min standard, with a 15% volume discount that brings it to $2.13/min at our volume. Catalogue retrofit = $179K, year-1 ongoing = $46K, total year-1 = $225K. Score 3/8 on price; year-1 total is 2.5× the budget envelope.
V3: $1.85/min standard with a similar volume discount to $1.55/min at our volume. Catalogue retrofit = $130K, year-1 ongoing = $33K, total year-1 = $163K. Score 4/8.
V4: $2.20/min standard, no volume discount offered. Catalogue retrofit = $185K, year-1 ongoing = $48K, total year-1 = $233K. 2.4× budget envelope. Score 2.5/8 — the price walked V4 out.
V5: $30/seat/month for the team-tier product. Functionally not comparable on a per-minute basis but works out to roughly $0.40/min equivalent at heavy usage, which would be the cheapest option if we cared only about price. Score 7/8 on price, but the score does not transfer to total-RFP score because Section 1 + Section 3 had already walked the vendor out.
V6: per-hour-tiered pricing — at our 1,400-hour back catalogue + 30 hours/month, the unlimited-usage Org tier at $299/month (per the published GlossCap pricing page) would have covered ongoing volume entirely; the back-catalogue retrofit was quoted at a one-time $35K to compress to the 8-week SLA. Total year-1 = $35K + ($299×12) = $38.6K. Score 8/8.

The price spread is what made the eventual decision feel less close than it looked on the leaderboard. V6's year-1 quote was 17% of V2's year-1 quote on equivalent or better Section 1 + Section 2 scoring. V6 was 24% of V3's quote. The structural reason is that V6's pricing model does not charge per-minute on ongoing volume — the unlimited-usage tier means the marginal cost of a new training video is zero. Per-minute models with a human reviewer in the loop cannot match this because the reviewer is a labour cost that scales linearly with minutes.

Q14 — Termination and data portability

The two specific items: (a) what notice period applies on customer-initiated termination, and (b) what is the data-export format and SLA on contract end. We needed the captions, the glossary state, the per-asset metadata (including which staff member completed which training, where the LMS integration carried that across), and the audit-trail data to be exportable in standard formats within 30 days.

V1: 30-day notice, captions exportable in SRT/VTT/TTML, no glossary export. Score 1.5/2.
V2: 30-day notice, captions in all standard formats, glossary exportable as JSON or CSV. Score 2/2.
V3: 60-day notice (a non-trivial penalty if you hit a hard end), captions in all standard formats, glossary export as XML. Score 1.5/2.
V4: 30-day notice, captions in all standard formats, no glossary export. Score 1.5/2.
V5: 30-day notice, captions in SRT/VTT only, no glossary concept. Score 1/2.
V6: 30-day notice, captions in all formats, glossary exportable as JSON, full per-asset metadata exportable as CSV including the per-customer model state (the "your glossary belongs to you" promise made operational). Score 2/2.

Section 4 totals

Section 4 was 20 points. We've quoted credible numbers on the difficult items only — the full sub-question math is in the supplied scoring sheet but the headline totals are:

V1: 13.5/20
V2: 14/20
V3: 14.5/20
V4: 11/20
V5: 12/20
V6: 19/20

The full scoring sheet

The total per vendor across the four sections (max 100):

Vendor	S1 (35)	S2 (20)	S3 (25)	S4 (20)	Total (100)
V1 (Rev-shaped)	15	11	19	13.5	58.5
V2 (3Play-shaped)	25	17	25	14	81
V3 (Verbit-shaped)	25	16	22	14.5	77.5
V4 (AI-Media-shaped)	21	12.5	22	11	66.5
V5 (Otter-shaped)	9	5	7	12	33
V6 (GlossCap-shaped)	33.5	19	21.5	19	93

The leaderboard read clean: V6 at 93, V2 at 81, V3 at 77.5, V4 at 66.5, V1 at 58.5, V5 at 33. The decision discussion centred on three things: the V6 SOC 2 Type II gap (real, time-bound), the V2 price (real, sustained), and the V6 architecture-is-the-product question (real, novel). The post-mortem captures the four mistakes we made along the way and the discussion that closed the SOC 2 gap.

The four mistakes we made

Mistake 1 — We asked for samples too late

Our original timeline put the Q4 sample request in week 3, after the Section 1 stated-claims responses had come back in week 2. We did this because we wanted to filter the Q4 list to vendors who had passed the stated-claims gate. The cost was that we lost a week — the sample was the most predictive input on the entire RFP, and we should have requested it in week 1 alongside the prequalification screen. Two of the vendors who took 5–7 days on the sample would not have run out the back of week 4 if we'd asked in week 1. Lesson: ask for the sample as part of prequalification. The vendors who refuse the sample at prequalification stage are vendors who either do not believe in their own product or do not have a real prospect-onboarding capacity, and you want to learn this on day one rather than day fifteen.

Mistake 2 — We tried to evaluate the demo with vendor-supplied audio

The first round of demo calls — three of the six vendors — used the vendor's own demo audio. We accepted this for the first three calls because the vendor had set up the call with their own infrastructure and we were the meeting attendees rather than the meeting hosts. The result was that the demo audio looked great on every vendor's pitch and was not predictive of any vendor's performance on our content. We re-ran the demo in week 4 with our own audio, and the rankings flipped on three of the six. The lesson is that a vendor's choice of demo audio is a marketing decision, and the L&D buyer's only safe move is to insist on supplying the audio. The companion lesson is that the audio you supply must be representative — not the cleanest 60 seconds in your catalogue, the worst 60 seconds. A vendor who passes a hard-case sample passes the easy cases automatically; the reverse is not true.

Mistake 3 — We did not put the BAA on the first call

The BAA question was on the prequalification email, but we did not raise it explicitly on the first call with each vendor. The result was that V5's "we may consider on enterprise terms" pre-call answer turned into "we don't sign BAAs" on the first call, and we lost the rest of the first call to that thread. V1's "BAA available on the enterprise tier" answer turned out to be gated at a $250K-minimum-spend tier — a fact we discovered only on call three. Both of these would have been visible on call one if we'd structured the agenda to put compliance posture first. The lesson is that compliance posture is not a Section 3 dialogue — it is a deal-breaker filter. Put it on the first call's agenda explicitly, in the first 15 minutes, and let the rest of the call run accordingly.

Mistake 4 — We weighted LMS integration as a hygiene item

Section 2 was 20 points in our weighting, and Q6 (LMS integration) was 8 of those 20. In retrospect, that was not enough. Two of the vendors we initially assumed were comparable on integration turned out to mean very different things by "we support TalentLMS" — V1 meant "we support the export format that TalentLMS imports" and V2 meant "we have a TalentLMS plugin that uploads captions directly to the LMS course resource." The labour difference between those two interpretations, on our 1,400-hour catalogue, is several hundred hours of L&D-team time over the retrofit window. For us that's not a hygiene item — it's the difference between hitting the SOC 2 audit window and missing it. The lesson: if your catalogue lives across more than two systems, weight integration depth as a Section 1-adjacent question rather than a Section 2 hygiene one. We'd rerun this with Section 2 at 25 points and Q6 at 12 of those.

The decision and the SOC 2 risk

We picked V6 (the GlossCap-shaped vendor). The leaderboard delta — 12 points clear at 93 vs 81 — would normally be decisive on its own. The argument we had to make to procurement and to the CFO was about the SOC 2 Type II gap. V6 was Type I rather than Type II at the time of decision, with the Type II audit window running through Q1 and Q2 2026. The customer questionnaire that triggered the RFP had asked about the captioning vendor's SOC 2 posture, and "Type I in Type II audit window" is a different answer than "Type II report current."

The argument that landed was three-part. First, V6 was on a documented Type II audit timeline that would close before our own SOC 2 Type II audit closed — the customer would receive a Type II report from us with a Type II report from V6 attached as the captioning sub-processor evidence, and the timing would be clean by the time the customer reviewed it. Second, the SOC 2 control coverage on the V6 Type I report (which we read end to end) was substantively the same as the Type II reports of V2 and V3 — the difference was the audit-period coverage rather than the controls themselves. Third, the V6 architectural advantage on Section 1 was not a 5%-better answer; it was a structurally different product. Section 1 was 35 points in the scoring sheet because that's what we said up-front; V6 won Section 1 by 8.5 points because the architecture is different. Trading off 3.5 points of Section 3 for 8.5 points of Section 1 was a defensible trade given that Section 1 was where the original procurement objective lived.

We documented the trade as a sub-processor risk acceptance signed by the chief people officer and the head of legal, with a quarterly review trigger, and we put the SOC 2 Type II expected-completion date in the contract as a milestone with a credit if missed. The CFO signed because the year-1 dollar delta — $38.6K vs $225K — was a number that made the procurement-risk question separable from the captioning-quality question. If the SOC 2 Type II milestone had slipped past Q3 2026, we'd have had room in the budget to switch to V2 with a one-time migration cost; with V2's quote we would have had no such optionality.

The 6-week timeline we actually ran

The procurement-team norm in this market is 12 weeks from RFP draft to signed contract. We ran six weeks. The timeline:

Week 0 (the week before the RFP): internal weight-setting on the four sections, longlist generation from analyst reports plus L&D Slack peer recommendations, draft of the prequalification email. Set the SOC 2 audit window date as the project-end date. Internal stakeholder briefing — head of L&D, head of customer success, head of legal, head of finance, head of security. The brief documented: budget envelope ($90K/year), audit-window deadline (16 weeks out), accuracy floor (the proper-noun failure mode is the explicit reason), and integration scope (TalentLMS, Kaltura, Wistia, Vimeo).
Week 1: prequalification email to 14 vendors, response window 5 business days. Sample audio and glossary prepared (a 60-second clinical-content excerpt with 17 named entities, plus a 60-term glossary). Internal scoring template finalised in a Google Sheet — each vendor a row, each sub-question a column, two evaluators per row to reduce single-evaluator bias.
Week 2: prequalification responses come in; six vendors pass. Section 1 questions sent to the six with a 7-business-day response window. Sample audio sent to the six with a 5-business-day return-by date. First-call scheduling for the same week.
Week 3: first calls with all six vendors. Section 1 responses arrive end-of-week from V2, V3, V6; samples arrive from V1, V6 by Wednesday and from V3, V5 by Friday. V2's sample arrives Monday of week 4 (one day late on the 5-day SLA). V4's sample arrives Wednesday of week 4 (two days late). Second calls scheduled for the back half of week 3.
Week 4: Section 2 and Section 3 questions sent. Internal scoring of Section 1 by the two evaluators (one L&D, one engineering — the engineer was load-bearing on the architectural distinction between glossary-biased decoding and find-and-replace post-processing). Re-ran demos with our own audio for the three vendors who'd used vendor-supplied audio on the first call. Three rankings flipped on the re-run.
Week 5: Section 4 (pricing and contract) questions sent. Reference calls with two of V2's customers and two of V6's customers — the question we asked all four customers was "what would you do differently if you were running this RFP again?" The answers were instructive: V2's customers said the human-reviewer SLA was the load-bearing thing they liked; V6's customers said the per-customer glossary state had compounded over time and was the reason they had not switched. Final scoring sheet drafted on Wednesday. Internal stakeholder review meeting Thursday — head of L&D, head of finance, head of legal. Decision-by-Friday norm met.
Week 6: contract redlines (V6 standard MSA against our standard playbook), BAA execution, DPA execution, integration kickoff scheduled, retrofit project plan drafted by L&D against the 8-week SLA, internal kickoff with stakeholders. Signature on Friday of week 6.

What made the 6-week timeline feasible was the prework in week 0 — the internal weight-setting, the audio-and-glossary preparation, the stakeholder pre-alignment. RFPs slip when the buyer is making decisions in real time during the procurement; our procurement was effectively a decision tree we walked through with vendor responses providing the inputs. The decision tree was set up before the first vendor email went out.

What we'd do differently the next time

If we ran this RFP again, here is the diff against what we did the first time:

Sample request in prequalification, not in week 3. Saves a week. Lets vendor-screening turn on the most predictive number from day one.
Compliance posture filter on call 1, not call 3. Saves vendor and buyer time. The BAA and SOC 2 questions are five minutes; if a vendor cannot answer "yes" or "yes, gated at $X," the call is over.
Section 2 weighted at 25, not 20, with Q6 at 12 of those. Reflects the operational reality of running captioning across multiple LMS / video-host systems, which is most mid-market companies' state.
Always supply the worst-case 60-second sample, not the average-case one. A vendor who passes the worst case passes the average case for free; the reverse never holds.
Reference-call question scripted up front. "What would you do differently if you were running this RFP again?" is the highest-yield reference-call question we asked. Use it.
Documented sub-processor risk-acceptance template. The V6 SOC 2 Type II gap turned into a 90-minute legal-team conversation that should have been a 15-minute sign-off. We've since written a template that makes the trade-off explicit: which Type-II-coverage gap, which audit-window-completion-date milestone, which credit on miss, who signs.
Run the sample twice — once with the supplied glossary and once without. The without-glossary number tells you the floor. The with-glossary number tells you the ceiling. Both numbers matter; we only collected the with-glossary number, and the without-glossary delta is what differentiates a glossary-aware vendor from a vendor who is selling a feature called "glossary."
Post-decision tracking metric. We forgot to set this. The metric we wish we'd set on day one of the contract: weekly rolling proper-noun-substitution-error count on a 100-asset rolling sample, charted by month. The lagging indicator of caption-quality degradation is the only way to know when to re-run an RFP, and we are now setting it up retroactively.

Where the per-vendor walk-throughs live

The vendor archetypes V1–V6 in this post are anonymised composites, but the public posts cover the named vendors directly:

Rev vs GlossCap · Rev alternative — the canonical Rev-shaped vendor walk-through, including the per-minute pricing and the AI-tier-vs-human-reviewed-tier choice that sat behind V1.
3Play vs GlossCap · 3Play alternative — the human-reviewer-only architecture and the higher per-minute floor that sat behind V2. The strongest vendor on Section 3 in our scoring; the price was the obstacle.
Verbit vs GlossCap · Verbit alternative — the higher-ed-and-enterprise vendor with the strongest Kaltura integration in the market, which sat behind V3. The hybrid glossary-as-prompt-bias-plus-find-and-replace architecture is the most credible non-V6 architecture for the proper-noun question.
How to pick a captioning vendor if you're a public university after ADA Title II — the parallel buyer-journey post for the higher-ed procurement, which mirrors this RFP shape but with a different weighting and a longer compliance-frameworks list (ADA Title II + Section 504 + Section 508 + IDEA Part D + FERPA all simultaneously).
Rev vs 3Play vs Verbit vs GlossCap: pricing breakdown — the volume-tier pricing math that sat behind Section 4 of this RFP, with the explicit break-even calculations.
Captioning vendor RFP template — the public version of the 14-question template we ran. Copy verbatim into your own procurement.

The scoring template you can use

The scoring sheet structure that worked for us, that you can copy verbatim:

Section 1 — Accuracy (35 points): Q1 (5), Q2 (10), Q3 (10), Q4 (10).
Section 2 — Format/Integration/Workflow (20 points): Q5 (2), Q6 (8), Q7 (6), Q8 (4).
Section 3 — Security/Privacy/Compliance (25 points): Q9 (8), Q10 (9), Q11 (8).
Section 4 — SLA/Pricing/Contract (20 points): Q12 (5), Q13 (8), Q14 (2), with the remaining 5 points covering the contract-redlining commentary.

Two evaluators per row. Reconcile weekly. Reference calls in week 5. Decision in week 5 or 6. Contract in week 6.

If your weighting differs (and it might — if your catalogue is not proper-noun-dense, drop S1 to 30 and pick up S2 by 5; if you are higher-ed or hospital, raise S3 to 30 and pick up by dropping S4 by 5), document the difference up front in the procurement brief and don't change weights mid-RFP. The single largest source of buyer's-remorse RFPs in this market is mid-RFP weight changes that follow vendor-pitch quality rather than internal procurement objectives. Set the weights once, in week 0, and let the responses fall where they fall.