Procurement · Published 2026-05-01

How we ran a captioning vendor RFP: scoring sheets, vendor responses, and what we'd do differently

This is a first-person walkthrough of a captioning-vendor RFP we ran for a 280-employee mid-market healthcare-adjacent SaaS company between November 2025 and January 2026. The trigger was the standard set: a 1,400-hour training-video back catalogue, two enterprise customers asking pointed questions about WCAG 2.1 AA conformance during their annual security review, the ADA Title II 2026-04-24 deadline on the horizon for the public-sector customers in the pipeline, and a board-level commitment to ship a clean SOC 2 Type II report by Q2 2026. We had a budget envelope of roughly $90K/year for captioning, a hard ban on hand-correcting auto-captions in-house (the half-FTE cost we walk through in the hidden-half-FTE post was the explicit reason), and a six-week target to pick a vendor and sign. Everything below is anonymised — the company is real, the catalogue is real, the vendor responses are real-shaped composites of the responses we got — but the names are mapped to vendor archetypes (V1–V6) so the post can be useful to other L&D leads running the same procurement. The 14-question RFP template we used is the public version of what we sent.

TL;DR

We ran a six-week RFP against six vendors, scored them on the 14-question template across four sections (accuracy, format/integration/workflow, security/privacy/compliance, SLA/pricing/contract), and ranked them on a 100-point scoring sheet. The shortlist after the prequalification round was three: V2 (3Play-shaped) at 82/100, V6 (GlossCap-shaped) at 84/100, V3 (Verbit-shaped) at 78/100. V1 (Rev-shaped) and V5 (Otter-shaped) failed Section 3 (security/privacy/compliance) on the BAA question alone. V4 (AI-Media-shaped) priced out at 2.4× the budget envelope on Section 4 (SLA/pricing/contract) and walked themselves out. We picked V6 because the per-customer glossary architecture closed the proper-noun failure mode that ate the most reviewer time on the back catalogue, and because the per-minute-equivalent pricing landed at 38% of V2's quote at our volume. The four mistakes we made were: (1) we asked for samples too late, (2) we tried to evaluate the demo with vendor-supplied audio rather than our own audio, (3) we did not put the BAA question on the first call, and (4) we weighted LMS integration as a hygiene item rather than a differentiator. Below is the full narrative, the scoring sheet, the timeline, and the post-mortem.

The org and the back catalogue

The company is a 280-employee mid-market SaaS in the healthcare-adjacent space — practice-management software for ambulatory surgery centers and outpatient specialty clinics. The customer roster includes 47 hospital systems, 12 academic medical centers, and a long tail of independent ASC operators. The training-video catalogue had grown organically over four years and consisted of, in round numbers, 600 hours of customer-facing product training (used in customer-onboarding webinars and embedded in the in-app help center), 350 hours of internal sales-enablement video (used at sales kickoff and in ongoing rep certification), 280 hours of clinical-content training video (used by customers' clinical staff to understand the workflow implications of the software, dense with the proper-noun categories we audited in the drug-names post), 110 hours of compliance training video (HIPAA, SOC 2, GDPR awareness, anti-bribery, anti-harassment), and roughly 60 hours of recorded all-hands and engineering deep-dive content with intermittent training-archive value. Total: 1,400 hours.

The state of captions on this catalogue at the start of the RFP was the modal mid-market state: roughly 220 hours had captions of some quality, mostly produced by uploading the audio to YouTube auto-caption and downloading the SRT, with about 40 of those hours subsequently hand-corrected by an instructional designer who had quietly absorbed the work over an 18-month period. The remaining 1,180 hours had no captions at all. The VP of customer success had been told the captions were 95% covered (because the customer-facing product training was nearly fully covered); the actual catalogue-wide coverage was 16%. This kind of perception gap is universal in mid-market L&D — the head of L&D is usually flying with one functional dashboard and a back catalogue nobody has audited end-to-end.

The trigger event was a customer security review. A new hospital-system customer's accessibility office sent a 22-question vendor questionnaire that explicitly asked, on Q14, whether all training video supplied to their organisation was captioned to WCAG 2.1 AA, and on Q15 whether the captioning vendor had executed a Business Associate Agreement under HIPAA in the case where customer-account names or any PHI surface appeared in the training. The honest answer to Q14 was "for the customer-facing product training but not for the clinical-content training," which the accessibility office flagged as inadequate. The honest answer to Q15 was "no, because we don't currently use a captioning vendor — captions are produced ad-hoc by an instructional designer." Neither answer worked. The customer success VP escalated to the chief people officer, who escalated to the CFO, who released the captioning-procurement budget within ten days. Welcome to the RFP.

What we needed the RFP to settle

The procurement objectives mapped to the four sections of the RFP template, with weights we set up front before any vendor talked to us. We treated this weight-setting as the most important step in the whole RFP — getting the weights right is what stops a procurement from being driven by sales-deck quality. The weights we used:

The weighting we explicitly rejected was the framework many mid-market RFPs default to: 25/25/25/25 across the four sections. A flat split sounds neutral, but it understates Section 1 by treating accuracy as a feature like pricing. For training video, accuracy is the product. The flat-split RFP is how mid-market companies end up signing two-year contracts with a vendor whose Section 1 score was 12/25 — adequate-looking on the scoring sheet but operationally useless on clinical content. We learnt this from talking to two peer L&D leads at companies in the same vertical who had run flat-split RFPs and were, six months later, in the process of running a second RFP to replace the vendor they had just signed.

The longlist and the prequalification

The captioning-vendor market is wider than most procurement teams realise on day one. We started with a longlist of 14 vendors drawn from analyst reports (Capterra, G2, Training Industry's vendor directories), peer-recommendation cycles in the L&D Slack communities we participate in, and a search-driven sweep of "captioning vendor" + adjacent search terms. The longlist was: Rev, 3Play Media, Verbit, AI-Media, Otter, Captiongenerator, AmberScript, Sonix, Vidcaster, Cielo24, Dotsub, GoTranscript, GlossCap, and one regional vendor whose pitch was that they could supply a human captioner for the clinical-content track and the AI auto-caption-plus-edit workflow for everything else.

We applied a five-question prequalification screen by email to the full 14, with a one-week response window. The prequalification questions were extracted from the full RFP template:

  1. Do you support per-customer glossaries that influence the model's decoding, or only post-hoc find-and-replace? (Question 3 from the template.)
  2. Will you sign a BAA, and under what trigger? (Question 10 from the template.)
  3. What is your stated word-error-rate floor on training video, and how is it measured? (Question 1 from the template.)
  4. What does pricing look like at our expected volume of 1,400 hours of one-time back-catalogue retrofit plus roughly 30 hours/month ongoing? (Question 13 from the template.)
  5. Can you turn around a 60-second sample on our supplied audio, with our supplied glossary, within five business days of contract signature? (Question 4 from the template.)

Six vendors made it through prequalification. Eight did not, for various reasons: three did not respond at all (the long tail of small captioning vendors is real and includes vendors who do not have a funnel for inbound enterprise leads); two refused to sign a BAA at any volume (a deal-breaker we put in the prequalification specifically because the question takes 30 seconds to answer and saves both sides weeks); one quoted pricing at $9 per minute on the back catalogue, which puts the catalogue retrofit alone at $756K and walked the vendor out by their own quote; and two said they only supported find-and-replace post-processing, not glossary-biased decoding, which we treated as a hard fail on the proper-noun question. The six that made it through were the canonical mid-market field — Rev-shaped, 3Play-shaped, Verbit-shaped, AI-Media-shaped, Otter-shaped (the shadow-IT contender, whose Section 3 answer we already suspected would not survive) and the GlossCap-shaped vendor that the head of L&D had been quietly piloting since October.

Section 1 — Accuracy and the proper-noun question (35 points)

The four questions in Section 1 are the load-bearing questions of the RFP. We asked all six vendors the same four questions on the same Tuesday afternoon, with two-week response windows, and we pre-supplied the audio and the glossary the response had to use. The audio was a 60-second clinical-content excerpt drawn from the actual back catalogue, picked to maximise proper-noun density: a passage covering medication-reconciliation workflow that referenced apixaban, rivaroxaban, dabigatran, edoxaban (all four DOACs by name), heparin and the reversal agents idarucizumab and andexanet alfa, plus three procedure terms (transcatheter aortic valve replacement, catheter ablation, ECMO cannulation), two ICD-10 codes (I48.91 atrial fibrillation, I82.409 acute embolism and thrombosis), and three software product names that map to the company's own vocabulary. The glossary we supplied was 60 terms — every name above plus the company's own product line, the names of the seven internal product modules, the four customer-onboarding-stage labels, and the names of the three integration partners.

The supplied-audio approach is what made the Section 1 scoring meaningful. Vendors who run their own demo audio always look great on their own demo audio, because the demo audio is selected for vendor-strength territory — single-speaker, professional studio recording, low proper-noun density. Real training video is multi-speaker, often recorded over Zoom or Google Meet at uneven audio levels, and dense with the categories of vocabulary that production-grade auto-caption systems were not trained for. The mismatch between vendor demo and customer reality is the largest source of buyer's remorse in this market. We ran the comparison on our actual audio because the question we were buying an answer to was "what does this vendor produce on our content," not "what does this vendor produce on their content."

Q1 — Word-error-rate floor on training video

Every vendor responded with a stated WER floor. The honest range we got back was 1% to 6%, with the lower end claimed by the vendors who run a human-reviewer step on every minute and the upper end claimed by the auto-only vendors. The number is partly real and partly marketing: WER is sensitive to the corpus on which it was measured, and the published numbers are typically benchmarked on broadcast-grade or interview-style audio rather than on training video. We treated the stated number as a floor claim and trusted the sample-result number from Q4 as the actual answer. The Q1 scoring weight was 5 points out of 35 — treated as a sanity question rather than a load-bearing one.

The vendor-by-vendor stated WER (anonymised, mapped roughly to the public benchmark each vendor publishes):

Q2 — How does accuracy degrade on proper-noun-dense content?

Q2 was where the responses started to separate. The auto-only vendors (V1 AI tier, V5) told us, accurately, that proper-noun density degrades accuracy and offered no specific mitigation beyond "longer audio gives the model more context." The human-review vendors (V2, V3, V4) told us their reviewer step would catch and fix proper nouns and that the catch rate depends on reviewer training in the relevant domain. The glossary-biased vendor (V6) told us that proper-noun degradation is exactly what their architecture optimises for and that the per-customer glossary was the artifact that closed it — a structurally different answer.

We scored Q2 on a 10-point scale based on three sub-criteria: does the vendor acknowledge the proper-noun failure mode at all, does the vendor have an architectural mitigation rather than a workflow patch, and can the vendor produce a numerical accuracy claim on proper-noun-dense content rather than on broadcast audio. The scoring:

Q3 — Per-customer glossary architecture vs find-and-replace

Q3 was the question that caught two of the original 14 vendors in prequalification (find-and-replace only) and was a deal-shaper among the six who made it through. The architectural distinction matters because find-and-replace post-processing fixes only the surface of the caption file — it changes "apex band" to "apixaban" after the fact — but does not change the model's decoding, which means the surrounding tokens are still produced under the wrong proper-noun assumption and the timing is still wrong. Glossary-biased decoding changes what the model produces in the first place, by injecting the glossary terms into the model's prior probability distribution. The technical-strategy companion post (prompting vs glossary models vs fine-tuning) walks the architectural difference in detail; the practical RFP-grading consequence is that a vendor whose only glossary support is find-and-replace is materially weaker on proper-noun-dense content than a vendor whose glossary is integrated at the decoding step.

The Q3 scoring (10 points):

Q4 — Show me a 60-second sample on our training audio with our glossary applied

This was the question that did the most work in the entire RFP. Five business days, our supplied audio, our supplied glossary. We scored on three sub-criteria: did the vendor turn the sample around within the SLA, what was the proper-noun substitution-error count on the 60-second sample, and what was the WCAG SC 1.2.4 caption-synchronisation accuracy as visually inspected by playing the sample alongside the audio. We scored Q4 on 10 points.

The Q4 result is what most decisively separated the vendors. V6 produced 0 errors on a 17-named-entity sample because the glossary was applied at decoding rather than at post-processing; the proper nouns were never produced incorrectly in the first place. V2 produced 2 errors on the same audio with a human reviewer involved, because the reviewer caught most of them but missed two; this is the structural ceiling of human-reviewer-only workflows on dense clinical content. V5 produced 13 errors because nothing in their architecture was directed at the failure mode. The Q4 score is the single most predictive number on the entire RFP.

Section 1 totals

Section 1 was 35 points. Final scores (Q1 + Q2 + Q3 + Q4):

Section 2 — Format, integration, and workflow (20 points)

Section 2 turned out to be where vendor pitches were strongest on paper and weakest in practice. Every vendor will say they support every common format and integrate with every common LMS; the actual integration depth varies enormously, and the vendor-portal-as-required-step pattern is the single most expensive workflow tax in the market. The four questions in Section 2:

Q5 — Caption file format support

We needed SRT, VTT, TTML (for the SCORM packages we ship to enterprise customers' internal LMSes), and STL for the one EU broadcast-licensed clip embedded in the compliance-training catalogue. All six vendors supported all four formats natively or via export — no scoring differentiation here. Q5 scored 2/2 across the board.

Q6 — Direct LMS / video-host integration

The catalogue lived across TalentLMS, Kaltura, Wistia, and Vimeo. The vendor-side integration depth varied:

Q7 — Reviewer step and reviewer UI

For the human-reviewed tiers, the reviewer UI determines how much work the L&D team has to do post-receipt. We scored on whether the reviewer UI was vendor-side or customer-side, whether the L&D team had a final-approval workflow, and whether changes flowed back to update the per-customer glossary or model state.

Q8 — Back-catalogue throughput

1,400 hours of one-time back-catalogue retrofit, ideally completed within 16 weeks (the SOC 2 audit window we were preparing for). All six vendors said they could meet this; the credible answers came with capacity caveats.

Section 2 totals

Section 2 was 20 points (Q5 2 + Q6 8 + Q7 6 + Q8 4). Final scores:

Section 3 — Security, privacy, compliance posture (25 points)

Section 3 was where the RFP shape mattered as much as the question content. We had three load-bearing items: HIPAA BAA execution (because of the customer-questionnaire trigger and the clinical-content track), GDPR Article 28 data-processor terms (because two of the EU customers were running the Article 28 chain through to us), and SOC 2 / ISO 27001 posture (because our own SOC 2 Type II audit was three months out, and our SOC 2 sub-processor list had to include the captioning vendor as a reviewable entry).

Q9 — Where is audio processed and stored?

We needed the answer to include three things: the geographic processing region, the storage retention, and the deletion behaviour on contract end. The full RFP question and the underlying compliance reasoning are documented on our public RFP template.

Q10 — Will you sign a BAA, and a DPA?

This is the question that sorted the field most decisively in Section 3. The trigger was the HIPAA BAA — without one, every minute of clinical-content training that referenced specific patient archetypes (a routine pattern in clinical-content training, where instructors illustrate workflows with anonymised case examples) was off-limits to the vendor and would have to be hand-handled internally.

Q11 — SOC 2, ISO 27001, FedRAMP, HITRUST posture

Our own SOC 2 audit was the immediate driver. The ISO 27001 question was forward-looking (the EU customers had asked about it). FedRAMP was not in scope for us in this RFP but we tracked the answer because the public-sector pipeline was real. HITRUST was tracked for the same reason as ISO 27001 — forward-looking.

Section 3 totals

Section 3 was 25 points (Q9 8 + Q10 9 + Q11 8). Final scores:

Section 4 — SLA, pricing, contract (20 points)

Section 4 is where the RFP becomes a procurement document rather than a technical evaluation. The four questions are the SLA on a 60-minute video, pricing structure at our volume, contract termination, and data-portability terms. We scored on 20 points.

Q12 — Turnaround SLA and credit

For ongoing volume of 30 hours/month, the SLA matters because L&D content has natural deadlines (course launch dates, customer-onboarding-cohort start dates, sales-kickoff dates). For back-catalogue retrofit, the SLA is less acute — the work is bulk and the customer-facing deadline is the SOC 2 audit, which we control.

Q13 — Pricing at our volume

This was the question with the largest absolute spread among the six. Our volume profile was 1,400 hours one-time (back-catalogue retrofit) plus 30 hours/month ongoing (12-month commit, with right to extend). The vendor pricing models varied: per-minute (V1, V2, V3, V4), per-hour-tiered (V6), and seat-based (V5). The vendor-pricing-breakdown post walks the volume math at three reference tiers; the numbers below are our actual quotes, normalised to a per-minute equivalent for comparability:

The price spread is what made the eventual decision feel less close than it looked on the leaderboard. V6's year-1 quote was 17% of V2's year-1 quote on equivalent or better Section 1 + Section 2 scoring. V6 was 24% of V3's quote. The structural reason is that V6's pricing model does not charge per-minute on ongoing volume — the unlimited-usage tier means the marginal cost of a new training video is zero. Per-minute models with a human reviewer in the loop cannot match this because the reviewer is a labour cost that scales linearly with minutes.

Q14 — Termination and data portability

The two specific items: (a) what notice period applies on customer-initiated termination, and (b) what is the data-export format and SLA on contract end. We needed the captions, the glossary state, the per-asset metadata (including which staff member completed which training, where the LMS integration carried that across), and the audit-trail data to be exportable in standard formats within 30 days.

Section 4 totals

Section 4 was 20 points. We've quoted credible numbers on the difficult items only — the full sub-question math is in the supplied scoring sheet but the headline totals are:

The full scoring sheet

The total per vendor across the four sections (max 100):

Vendor S1 (35) S2 (20) S3 (25) S4 (20) Total (100)
V1 (Rev-shaped)15111913.558.5
V2 (3Play-shaped)2517251481
V3 (Verbit-shaped)25162214.577.5
V4 (AI-Media-shaped)2112.5221166.5
V5 (Otter-shaped)9571233
V6 (GlossCap-shaped)33.51921.51993

The leaderboard read clean: V6 at 93, V2 at 81, V3 at 77.5, V4 at 66.5, V1 at 58.5, V5 at 33. The decision discussion centred on three things: the V6 SOC 2 Type II gap (real, time-bound), the V2 price (real, sustained), and the V6 architecture-is-the-product question (real, novel). The post-mortem captures the four mistakes we made along the way and the discussion that closed the SOC 2 gap.

The four mistakes we made

Mistake 1 — We asked for samples too late

Our original timeline put the Q4 sample request in week 3, after the Section 1 stated-claims responses had come back in week 2. We did this because we wanted to filter the Q4 list to vendors who had passed the stated-claims gate. The cost was that we lost a week — the sample was the most predictive input on the entire RFP, and we should have requested it in week 1 alongside the prequalification screen. Two of the vendors who took 5–7 days on the sample would not have run out the back of week 4 if we'd asked in week 1. Lesson: ask for the sample as part of prequalification. The vendors who refuse the sample at prequalification stage are vendors who either do not believe in their own product or do not have a real prospect-onboarding capacity, and you want to learn this on day one rather than day fifteen.

Mistake 2 — We tried to evaluate the demo with vendor-supplied audio

The first round of demo calls — three of the six vendors — used the vendor's own demo audio. We accepted this for the first three calls because the vendor had set up the call with their own infrastructure and we were the meeting attendees rather than the meeting hosts. The result was that the demo audio looked great on every vendor's pitch and was not predictive of any vendor's performance on our content. We re-ran the demo in week 4 with our own audio, and the rankings flipped on three of the six. The lesson is that a vendor's choice of demo audio is a marketing decision, and the L&D buyer's only safe move is to insist on supplying the audio. The companion lesson is that the audio you supply must be representative — not the cleanest 60 seconds in your catalogue, the worst 60 seconds. A vendor who passes a hard-case sample passes the easy cases automatically; the reverse is not true.

Mistake 3 — We did not put the BAA on the first call

The BAA question was on the prequalification email, but we did not raise it explicitly on the first call with each vendor. The result was that V5's "we may consider on enterprise terms" pre-call answer turned into "we don't sign BAAs" on the first call, and we lost the rest of the first call to that thread. V1's "BAA available on the enterprise tier" answer turned out to be gated at a $250K-minimum-spend tier — a fact we discovered only on call three. Both of these would have been visible on call one if we'd structured the agenda to put compliance posture first. The lesson is that compliance posture is not a Section 3 dialogue — it is a deal-breaker filter. Put it on the first call's agenda explicitly, in the first 15 minutes, and let the rest of the call run accordingly.

Mistake 4 — We weighted LMS integration as a hygiene item

Section 2 was 20 points in our weighting, and Q6 (LMS integration) was 8 of those 20. In retrospect, that was not enough. Two of the vendors we initially assumed were comparable on integration turned out to mean very different things by "we support TalentLMS" — V1 meant "we support the export format that TalentLMS imports" and V2 meant "we have a TalentLMS plugin that uploads captions directly to the LMS course resource." The labour difference between those two interpretations, on our 1,400-hour catalogue, is several hundred hours of L&D-team time over the retrofit window. For us that's not a hygiene item — it's the difference between hitting the SOC 2 audit window and missing it. The lesson: if your catalogue lives across more than two systems, weight integration depth as a Section 1-adjacent question rather than a Section 2 hygiene one. We'd rerun this with Section 2 at 25 points and Q6 at 12 of those.

The decision and the SOC 2 risk

We picked V6 (the GlossCap-shaped vendor). The leaderboard delta — 12 points clear at 93 vs 81 — would normally be decisive on its own. The argument we had to make to procurement and to the CFO was about the SOC 2 Type II gap. V6 was Type I rather than Type II at the time of decision, with the Type II audit window running through Q1 and Q2 2026. The customer questionnaire that triggered the RFP had asked about the captioning vendor's SOC 2 posture, and "Type I in Type II audit window" is a different answer than "Type II report current."

The argument that landed was three-part. First, V6 was on a documented Type II audit timeline that would close before our own SOC 2 Type II audit closed — the customer would receive a Type II report from us with a Type II report from V6 attached as the captioning sub-processor evidence, and the timing would be clean by the time the customer reviewed it. Second, the SOC 2 control coverage on the V6 Type I report (which we read end to end) was substantively the same as the Type II reports of V2 and V3 — the difference was the audit-period coverage rather than the controls themselves. Third, the V6 architectural advantage on Section 1 was not a 5%-better answer; it was a structurally different product. Section 1 was 35 points in the scoring sheet because that's what we said up-front; V6 won Section 1 by 8.5 points because the architecture is different. Trading off 3.5 points of Section 3 for 8.5 points of Section 1 was a defensible trade given that Section 1 was where the original procurement objective lived.

We documented the trade as a sub-processor risk acceptance signed by the chief people officer and the head of legal, with a quarterly review trigger, and we put the SOC 2 Type II expected-completion date in the contract as a milestone with a credit if missed. The CFO signed because the year-1 dollar delta — $38.6K vs $225K — was a number that made the procurement-risk question separable from the captioning-quality question. If the SOC 2 Type II milestone had slipped past Q3 2026, we'd have had room in the budget to switch to V2 with a one-time migration cost; with V2's quote we would have had no such optionality.

The 6-week timeline we actually ran

The procurement-team norm in this market is 12 weeks from RFP draft to signed contract. We ran six weeks. The timeline:

What made the 6-week timeline feasible was the prework in week 0 — the internal weight-setting, the audio-and-glossary preparation, the stakeholder pre-alignment. RFPs slip when the buyer is making decisions in real time during the procurement; our procurement was effectively a decision tree we walked through with vendor responses providing the inputs. The decision tree was set up before the first vendor email went out.

What we'd do differently the next time

If we ran this RFP again, here is the diff against what we did the first time:

  1. Sample request in prequalification, not in week 3. Saves a week. Lets vendor-screening turn on the most predictive number from day one.
  2. Compliance posture filter on call 1, not call 3. Saves vendor and buyer time. The BAA and SOC 2 questions are five minutes; if a vendor cannot answer "yes" or "yes, gated at $X," the call is over.
  3. Section 2 weighted at 25, not 20, with Q6 at 12 of those. Reflects the operational reality of running captioning across multiple LMS / video-host systems, which is most mid-market companies' state.
  4. Always supply the worst-case 60-second sample, not the average-case one. A vendor who passes the worst case passes the average case for free; the reverse never holds.
  5. Reference-call question scripted up front. "What would you do differently if you were running this RFP again?" is the highest-yield reference-call question we asked. Use it.
  6. Documented sub-processor risk-acceptance template. The V6 SOC 2 Type II gap turned into a 90-minute legal-team conversation that should have been a 15-minute sign-off. We've since written a template that makes the trade-off explicit: which Type-II-coverage gap, which audit-window-completion-date milestone, which credit on miss, who signs.
  7. Run the sample twice — once with the supplied glossary and once without. The without-glossary number tells you the floor. The with-glossary number tells you the ceiling. Both numbers matter; we only collected the with-glossary number, and the without-glossary delta is what differentiates a glossary-aware vendor from a vendor who is selling a feature called "glossary."
  8. Post-decision tracking metric. We forgot to set this. The metric we wish we'd set on day one of the contract: weekly rolling proper-noun-substitution-error count on a 100-asset rolling sample, charted by month. The lagging indicator of caption-quality degradation is the only way to know when to re-run an RFP, and we are now setting it up retroactively.

Where the per-vendor walk-throughs live

The vendor archetypes V1–V6 in this post are anonymised composites, but the public posts cover the named vendors directly:

Related questions

How long does this kind of RFP usually take in our market?

The procurement-team-driven norm in mid-market SaaS is 10–14 weeks from internal-kickoff to signature. Higher-ed procurement is 4–8 months because of the public-procurement-rules overhead (state RFP regulations, board-of-trustees-approval steps, state-IT-procurement-portal posting requirements). Hospital-system procurement is similar to higher-ed for similar reasons. We ran ours in 6 weeks because the trigger was a customer-facing audit deadline and we had the procurement playbook ready before we started; the speed is unusual but not unique. The factor that compresses the timeline is the prework in week 0 — once that's done, the rest is execution.

Does the per-customer glossary architecture really make this much difference?

On clinical-content training video at our volume, yes — the Section 1 score difference between V6 and the next-best vendor was 8.5 points out of 35, which is a structurally different product rather than a 5%-better one. The architectural reason is that glossary-biased decoding changes the model's prior over the proper-noun vocabulary at decode time, which means the model produces correct proper nouns in the first pass rather than producing wrong proper nouns and relying on a reviewer to catch them. The catch-rate ceiling on human reviewers is the issue. Our 17-named-entity sample captured this directly: V6 produced 0 errors at decoding because the proper nouns were never produced wrong; V2 produced 2 errors at reviewer-catch because the reviewer caught most of them but not all. On a 1,400-hour catalogue with the proper-noun density we have, that delta is the difference between needing a hand-correction step at all and not needing one. The technical-strategy post walks the architectural detail.

What if our content is not clinical / proper-noun-heavy? Does Section 1 still dominate?

Less. If your training-video catalogue is mostly soft-skills training, leadership development, sales-coaching role-plays, or general compliance modules with low proper-noun density, the proper-noun failure mode is less load-bearing and Section 1 weighting can come down. We'd still push back on the flat-25-25-25-25 default — captions are the product, not a feature, and accuracy should always be weighted highest — but the Section 1 weight could come down to 30 or even 25 in low-proper-noun-density catalogues. The lever the buyer should pull instead in those cases is Section 2 (workflow), because soft-skills content is usually higher-volume and the workflow tax dominates the cost equation.

How did you handle the engineer-on-the-evaluation-team being a load-bearing input?

We put the engineer (a senior backend engineer who had spent two years doing speech-recognition adjacent work earlier in their career, by accident — engineering-team RFP volunteers are a lottery) on the evaluation team specifically because Section 1 has architectural questions that an L&D-only evaluation team cannot reliably answer. The architectural distinction between glossary-biased decoding and find-and-replace post-processing is not a marketing claim; it is a technical claim that the L&D team is the wrong team to grade. The lesson is that for any RFP where the product is technical, you need one technical evaluator and one functional evaluator, scoring the same dimensions independently and reconciling on a weekly cadence. If your team does not have a load-bearing engineering volunteer, an external consultant for the duration of the RFP is the alternative — we did not need this but I'd budget for it next time.

How did the EU customers' GDPR Article 28 chain work in practice?

Two of the EU customers had asked us, as part of their data-processing-agreement onboarding, to identify our captioning sub-processor and to provide DPA-cascade evidence (their DPA signed with us, our DPA signed with the captioning vendor, geographic-processing-region documentation, retention-and-deletion documentation). All three of the leading vendors (V2, V3, V6) supplied this without friction. The Article 28 cascade is a paperwork load-bearing thing more than a technical one, but it is the kind of paperwork that surfaces during EU customer onboarding and silently gates the deal close. The EAA-captions-EU-SMBs post walks the broader EU compliance picture; the Article 28 specifically is in the GDPR statute itself rather than in EAA. In our experience, vendors who treat the DPA cascade as a "we'll get back to you" item are vendors who cannot service EU customers; vendors who supply the cascade in 48 hours can. This is a hidden Section 3 sub-question that we did not weight separately but should have.

What did you do with the in-house instructional designer who had been hand-correcting captions for 18 months?

This is a real question and the honest answer is: we promoted them. Hand-correcting captions for 18 months had given the instructional designer a granular knowledge of the catalogue that nobody else in L&D had — they knew which courses had which proper-noun failure patterns, which vendor-supplied transcripts dropped which timing markers, and which courses had been quietly tagged as "do not re-record" because of speaker-availability issues. We promoted them to L&D operations lead, with the captioning-vendor relationship and the per-customer glossary curation as their primary remit. The argument we made internally is that the half-FTE hand-correction labour was being spent on the wrong layer of the stack — the labour was correcting the model's output rather than improving the model's input — and that the same person could spend the same labour on a higher-leverage activity (curating the glossary, evaluating the vendor's reviewer-step quality, escalating the surface-level quality issues that an L&D-internal-only person could see). 18 months of hand-correction had built the catalogue knowledge that nobody else had; that knowledge was the load-bearing input we wanted on the captioning vendor's customer side.

How is V6 working out six months in?

The catalogue retrofit shipped in 8 weeks against the 8-week SLA. The SOC 2 Type II milestone closed in week 14, ahead of the contract milestone date. The per-customer glossary state has grown from the original 60 terms to 340 terms over the 24 weeks since contract signature, with the L&D operations lead curating new terms as they emerge. The proper-noun substitution-error count on the rolling 100-asset sample (which we did set up retroactively per the post-mortem) is averaging 0.4 errors per asset, down from the unmeasured baseline that triggered the procurement. The customer questionnaire that triggered the RFP is now answerable with "yes, all training video supplied to your organisation is captioned to WCAG 2.1 AA, our captioning sub-processor is V6, here is their SOC 2 Type II report and their executed BAA." The customer security-review process now closes in 4 business days rather than the previous 14. None of this is the captioning vendor's product alone — the customer-success team's process improvements account for at least half of the close-time improvement — but the captioning vendor's posture is the unblocking input.

The scoring template you can use

The scoring sheet structure that worked for us, that you can copy verbatim:

Two evaluators per row. Reconcile weekly. Reference calls in week 5. Decision in week 5 or 6. Contract in week 6.

If your weighting differs (and it might — if your catalogue is not proper-noun-dense, drop S1 to 30 and pick up S2 by 5; if you are higher-ed or hospital, raise S3 to 30 and pick up by dropping S4 by 5), document the difference up front in the procurement brief and don't change weights mid-RFP. The single largest source of buyer's-remorse RFPs in this market is mid-RFP weight changes that follow vendor-pitch quality rather than internal procurement objectives. Set the weights once, in week 0, and let the responses fall where they fall.

Further reading