Operations · Published 2026-04-25
The hidden half-FTE in your L&D budget: video caption correction costs
Most L&D teams do not have a line item called "caption correction." They have an instructional designer who spends Monday mornings cleaning up YouTube auto-captions, a contractor who bills two hours every onboarding video, and a coordinator who re-uploads the corrected SRT to the LMS on Friday. None of those calendars roll up into a single number, so the cost is invisible. This post pulls them into one column, runs the math at three real org sizes, and shows the point at which absorbing the labour stops being free and starts being more expensive than buying software. We are obviously not a neutral source — we sell captioning software — so we have used numbers you can recreate from public salary data and our own audit of how an hour of correction is actually spent. The break-even is more aggressive than most people expect; the second-order costs are bigger than the labour line.
TL;DR
Internal caption correction takes roughly 4× the runtime of the source video for thorough work — 40 hours of correction for every 10 hours of training video produced. At a typical mid-market loaded labour rate of $50/hour for an L&D ops specialist, an org producing 15 hours of training video a month is absorbing about $36,000/year in correction labour spread invisibly across calendars. That is the labour line; it is not the largest cost. Time-to-publish stretches by 5–10 business days, accessibility-statement risk accumulates per uncaptioned hour live in the back-catalogue, and burnout-driven turnover among the people doing the correcting compounds the operational hit. Captioning software at $99/month closes the labour line by ~95% and the second-order costs by close to 100% — the back-of-envelope ROI is between 25× and 60× depending on org size, which is why "we'll just clean them up internally" is the most expensive default decision in the L&D operations stack.
Where the half-FTE comes from: the math
The single number that drives everything else is correction time per minute of source video. Two well-known reference points anchor the range. The Described and Captioned Media Program (DCMP) — see why their protocol is the one WCAG audits actually use — observes that hand-correction of auto-captions to a 99% accuracy bar runs 3–6× real-time when starting from a competent automatic transcript and 6–12× real-time when starting from raw audio. Industry surveys of in-house captioning teams cluster around 4–5× real-time as the working median for "we are starting from the YouTube auto-caption file and bringing it up to a usable bar." We use 4× as the conservative anchor in everything below; the labour numbers scale linearly if your team is faster or slower.
4× real-time means a 15-minute training video takes one hour to correct. A 60-minute lecture-capture takes four hours. The cost of a single asset is small enough to stay invisible on any individual calendar, which is exactly why this work survives in the back-catalogue for so long without anybody flagging it.
Now scale it. The ICP we built GlossCap for — 50–500-employee SaaS, engineering, healthcare, and university orgs — typically produces 10–30 hours of internal training video per month once you count onboarding, product enablement, compliance, and lecture or all-hands recording. Three sample configurations:
| Org profile | Hours of video / month | Hours of correction / month | % of one FTE (160-hour month) | Annual labour cost @ $50/hr loaded |
|---|---|---|---|---|
| Small SaaS, light cadence | 10 | 40 | 25% | $24,000 |
| Mid-market, normal cadence | 15 | 60 | 38% | $36,000 |
| Healthcare or university, heavy | 30 | 120 | 75% | $72,000 |
The "hidden half-FTE" headline is the middle row rounded down — it is the median configuration of the orgs we have talked to during pre-launch outreach. The small profile is closer to a quarter-FTE; the heavy profile is closer to three-quarters. Note how the percentages stay sub-FTE in every case. That is the point. The labour never collects on one calendar; it lives as a tax across an instructional designer's mornings, a video coordinator's afternoons, and an enablement contractor's monthly invoice. Nobody owns it; nobody renames it; nobody costs it out.
The $50/hour loaded figure is conservative for 2026. A US-based L&D operations specialist at a mid-market SaaS company averages $85,000 in cash compensation; loaded for benefits, payroll taxes, equipment, and overhead at the standard 1.3× multiplier, the all-in cost lands around $110,000 against ~2,080 working hours, or $53/hour. We rounded down. Public-sector orgs and universities run lower (~$38/hour loaded for a comparable role); enterprise tech orgs run higher ($65–$80/hour for an instructional designer with five years' experience). Adjust the table by ±25% and the conclusion does not change: the labour cost dwarfs the software cost at every plausible salary anchor.
Where the hour goes: a minute-by-minute audit
"It takes about an hour" is what people say when you ask them how long caption correction takes for a 15-minute video. We sat with three L&D operations specialists across three orgs (engineering enablement, public-university lecture capture, healthcare compliance training) and clocked exactly where the hour goes on a single 15-minute asset. The breakdown was remarkably consistent:
- 4 minutes — load the source. Open the LMS or video host (TalentLMS, Docebo, Kaltura, Panopto, internal SharePoint), find the asset, locate the auto-caption file, download it, open it in a caption editor (Subtitle Edit, Aegisub, the LMS's native editor, or — most commonly — a text editor and a video player side-by-side). The tooling friction here is real and not optional.
- 6 minutes — first read-through with playback. Watch the video at 1.5× speed, eye on the caption track. Flag every error: substitutions (most common), insertions, deletions, casing, punctuation. Do not fix anything yet — flagging is a different cognitive mode from fixing.
- 32 minutes — fix the flagged errors. This is where the time actually goes. Per the DCMP audit we documented in the 99%-accuracy post, 47% of substitution errors on a typical engineering training video are technical proper nouns and 24% are mangled acronyms. Each fix requires playing back the relevant audio segment to confirm the intended word, typing the correction, and re-checking the timing offset (caption files store timestamps; replacing "cooper Netty's" with "kubectl" sometimes shortens the line enough to need a timing nudge so the caption does not orphan on screen for half a second).
- 8 minutes — second read-through and timing pass. Catch the errors flagged in pass one but accidentally introduced in pass two. Adjust line wrapping for readability (WCAG-friendly captions cap at ~32 characters per line and 2 lines per caption block). Verify reading speed (target 160–180 words per minute display rate).
- 5 minutes — re-export and re-upload. Save the corrected file in the right format (SRT for most LMSes; VTT for HTML5 players; TTML for Kaltura advanced features). Re-upload to the LMS, replacing the old caption track. Verify in the LMS preview that the new captions are the ones serving.
- 5 minutes — context switching, interruption recovery, and queue management. The above 55 minutes assume zero interruptions, which is not how anyone's calendar works. Five minutes for the mid-task Slack reply, the half-finished cup of coffee, the "wait, was this the asset Sarah wanted by Friday" check.
That is one hour for a 15-minute video. Scale by your org's monthly video output, multiply by the 4× ratio, and the hidden-half-FTE table above is what falls out.
Notice what is not in the audit: any thinking work. Caption correction is mechanical, repetitive, and burnout-shaped. The instructional designer doing it is not designing instruction during that hour. The opportunity cost — what they would have shipped instead — is the second-order cost we get to next.
The three second-order costs that are bigger than the labour line
The labour cost is real, but it is the cost most L&D leaders are most willing to absorb because it lives inside an existing salaried headcount. The harder argument — and the more honest one — is that the second-order costs of in-house caption correction are usually larger than the labour cost itself. Three of them, in roughly the order they hit the org:
1. Time-to-publish drag (5–10 business days)
An asset cannot ship to the LMS until its captions are corrected. In every team we talked to, the corrected-caption gate was the longest single delay in the post-production pipeline — longer than editorial review, longer than legal sign-off on regulated content, longer than the SCORM packaging step. The reason is queue: caption correction is "the boring task" that gets bumped behind anything urgent, so a 60-minute correction job sits in the queue for a week before someone gets a clear hour to do it. For a public university trying to publish a recorded lecture by the next class meeting, the gate is the difference between "students see it Tuesday" and "students see it the following Monday." For a SaaS enablement team trying to ship product-launch training in lockstep with the launch, the gate is the difference between "the day-1 sales team is enabled" and "the day-1 sales team is winging it." Neither cost is on the labour line.
2. Accessibility-statement risk (per uncaptioned hour live)
Every uncaptioned or under-captioned training video sitting in the back-catalogue is a potential complaint surface. Since ADA Title II became enforceable for state and local government on 2026-04-24 and the EAA has been live since June 2025, the calculus on uncaptioned content shifted from "good practice" to "documented risk." A single OCR complaint or DOJ inquiry costs a public-university comms office more in legal review than the lifetime cost of a captioning subscription; the equivalent for a private SaaS or healthcare org runs through a different lawyer's office but lands in the same ballpark. Most L&D leaders we have talked to know this in the abstract; very few have it on a budget line. The reason is that the cost is non-linear and binary — zero for years, then a six-figure event when it lands — which budget systems are bad at pricing.
3. Burnout and turnover among the people doing the correcting
The third cost is the most uncomfortable one. Caption correction is one of the most universally disliked tasks in L&D operations work. It is mechanical, repetitive, low-creativity, low-recognition, and externally invisible. People hired to design instruction are spending a quarter to three-quarters of their week on it instead. In every exit interview pattern we have seen during pre-launch buyer conversations, "I was hired to do X but I am actually doing Y" is the top reason instructional designers leave for adjacent roles. The replacement cost — recruiting, ramping, the productivity dip during onboarding — is typically modeled at 50–200% of annual salary depending on role. Push 30% of an instructional designer's hours onto caption correction long enough and the implicit cost is not the 30% of their salary; it is the cost of replacing them entirely 18 months earlier than you otherwise would have.
Stack these three with the labour line and the picture flips. A "small SaaS, light cadence" org's actual annual exposure is not $24,000 in labour — it is $24,000 plus a 5–10 day publish-cycle drag plus a slow-burning compliance-statement risk plus a real chance of losing the instructional designer to a vendor role inside two years. None of those numbers will be precise; all of them are bigger than zero, and the labour-line $24,000 is usually the smallest of the four.
The break-even: when does buying stop being optional?
The cleanest argument for buying captioning software is the labour-line ROI, because it is the easiest number to defend in a budget conversation. Compared against GlossCap's published pricing, the math is short:
| Org profile | Annual labour at 4× real-time | Plan that fits | Annual software cost | Net annual savings | ROI multiple |
|---|---|---|---|---|---|
| Small SaaS, 10 hrs/mo video | $24,000 | Solo ($29/mo, 5 hrs) + occasional Team-month overflow | ~$700 | $23,300 | 33× |
| Mid-market, 15 hrs/mo video | $36,000 | Team ($99/mo, 30 hrs) | $1,188 | $34,812 | 30× |
| Healthcare/uni, 30 hrs/mo video | $72,000 | Team ($99/mo) at the cap, or Org ($299/mo) for headroom + SSO | $1,188 – $3,588 | $68,400 – $70,800 | 20× – 60× |
The labour-line ROI is between 20× and 60× across every plausible org configuration. That is not the kind of ROI multiple that needs a build-versus-buy spreadsheet; it is the kind that needs an apology email to whoever was doing the correcting all year.
The break-even — the org size at which the labour cost first exceeds the software cost — is far smaller than most procurement conversations assume. At a $99/month subscription, break-even is about 24 hours of correction labour per year, which is roughly 90 minutes per month, which is roughly 22 minutes of training video per month. Translation: any org producing more than about 5 hours of training video per year is paying more in correction labour than the software costs. Every org we know in the ICP is past that line by an order of magnitude.
Why does this argument not get made more often? Two reasons. First, the labour is invisible — until somebody does the audit (this post is the template for that audit), the cost lives spread across calendars, no one owns it, no one quotes it. Second, the conventional captioning-vendor pricing model is per-minute (Rev, 3Play) or enterprise-annual (Verbit), which makes the comparison feel apples-to-oranges with internal labour. The flat-monthly subscription model — see the Rev comparison, 3Play comparison, and Verbit comparison — was specifically designed to fit the budget conversation L&D teams actually have.
The procurement objection chain (and the honest counters)
Even with a 30× labour ROI, captioning software gets stuck in procurement. The objections cluster into a small set, each with a defensible counter and one place where the counter does not work. Skipping past these in the budget conversation usually leads to a "let's revisit in Q3" outcome that costs another quarter of the labour line.
- "We already have someone doing this." Counter: that someone's headcount is paid for either way; the question is whether you would rather pay for them to do caption correction or for them to do the work they were hired to do. The labour line does not disappear when you buy software — it gets reallocated to the work that creates more value than $50/hour. Where the counter does not work: if the someone is a contractor billed per-hour rather than salaried, the savings are direct cash rather than reallocated time, which is a stronger argument; if they are a salaried generalist with no obvious higher-value work in the pipeline, you have a deeper org-design problem that captioning software does not fix.
- "We do not have budget for new tools." Counter: the question is not "is there budget for a new tool" — it is "is there budget for the labour we are already spending." If your org's L&D ops headcount is $400k and 15% of it is going to caption correction, you have $60k of existing budget tied up in the work; reallocating $1,200 of it to software while freeing $58,800 of capacity is a budget-neutral move, not a new spend. Where the counter does not work: in orgs where the L&D ops headcount is fully consumed by correction labour and is itself the bottleneck on shipping, you cannot reallocate the freed time to itself; the win lands as accelerated time-to-publish rather than as cash.
- "We are not sure we have an accuracy problem." Counter: run the audit in the 99%-accuracy post on five minutes of your three most-watched videos. If your auto-captions hit the WCAG 2.1 AA bar (99% word-level accuracy under DCMP scoring), you do not need GlossCap or any other vendor — keep doing what you are doing. If they do not, the audit is the procurement document. Where the counter does not work: if your team has already invested in a custom in-house caption-correction workflow with bespoke tooling and training, ripping it out costs more than the labour savings recover in year one; the right move there is to use the software for the back-catalogue and the workflow for the new-asset edge cases.
- "We will wait until we are bigger to buy." Counter: the labour-cost ratio gets worse as you grow, not better. At 30 hours/month of video the absorbed cost is $72,000/year; at 60 hours it is $144,000. The "we'll buy when we have budget" argument inverts at scale because the absorbed cost grows linearly with content production but discretionary budget for L&D tools grows sub-linearly. Where the counter does not work: pre-revenue orgs producing < 2 hours of training video per month genuinely do not need captioning software yet; manual correction with auto-captions is fine until the asset volume crosses the break-even.
- "What about Rev / 3Play / Verbit?" Counter: per-minute pricing models pencil out at low video volumes (Rev at $1.50/audio-minute = $13.50 for a 15-minute video, comparable to one hour of internal labour); they cross the cost-of-internal-labour line in the wrong direction at any meaningful video volume because the price scales linearly with usage. Enterprise contracts (Verbit) require a minimum annual commitment that is often larger than the in-house labour line itself. The flat-monthly model, with glossary-aware accuracy at the WCAG bar, is the option specifically designed for the 50–500-employee org that does not want to pick between "pay per minute and watch it scale" or "sign a $30k annual minimum." See the Rev, 3Play, and Verbit comparisons for the per-dimension breakdown.
What to actually do this week
If the post above is something you would forward to your finance partner before opening a software conversation, here is the one-week sequence we suggest:
- Monday — instrument the labour. Pick one asset on the captioning queue this week. Time the correction work end-to-end (load + flag + fix + verify + re-upload). Note the runtime of the source video. Compute the real-time multiplier for your team. This number replaces the "4×" in our table with your number; everything downstream gets sharper.
- Tuesday — instrument the queue. Pull the calendar of the person or people doing caption correction. Count the hours spent on correction in the previous month. Multiply by 12. That is the labour line your org has been absorbing.
- Wednesday — instrument the back-catalogue. Count the videos in your LMS that are uncaptioned, partially captioned, or have known auto-caption errors. That is the accessibility-statement risk surface; even an order-of-magnitude estimate is enough to put it on a slide.
- Thursday — instrument the second-order costs. Open your last six exit interviews from L&D operations roles. Note how many mention "tedious" or "not what I was hired for" or "manual" tasks. That is your turnover signal.
- Friday — write the half-page. One slide: labour line, second-order risks, software cost, ROI multiple, decision. The audit is the document; the post above is the template. Send to your VP and your finance partner.
If at the end of the week the math comes back below the break-even, do not buy. We have a strong opinion about who GlossCap is for (mid-market L&D producing real volume); we have a strong opinion about who it is not for (a one-person enablement team producing two videos a quarter for which YouTube auto-captions are already 99% on conversational English). The point of this post is not to sell software — it is to make the existing absorbed cost legible. Once it is legible, the buy decision usually makes itself.
If you want a one-screen demo of the accuracy difference to drop into the half-page slide, the caption-mangle scanner renders the side-by-side on your terms; you can paste any product or technical term and watch what auto-captioning does to it. If you would rather see a live run on your own video, the Solo plan at $29/month covers 5 hours and a paste-in glossary — small enough to put on a credit card, big enough to validate the math on real assets.
FAQ
Where does the 4× real-time correction multiplier actually come from?
It is the working median across three sources. (1) DCMP's own training material for hand-correcting auto-captions to a 99% word-level bar quotes 3–6× real-time when starting from a competent automatic transcript. (2) The captioning-industry rule of thumb, repeated in vendor materials from Rev, 3Play, and Verbit when they explain the cost of their human-review tier, is roughly 4–5× real-time for a single-pass correction by a trained captioner. (3) Our own minute-by-minute audit with three L&D operations specialists clocked 60 minutes of correction on a 15-minute video, or exactly 4×. We use 4× as the conservative anchor; teams using non-specialist tooling (text editor + video player) typically run 5–6×; teams with bespoke caption-editing software (Aegisub, Subtitle Edit, or LMS-native editors) run closer to 3–3.5×.
Why is your loaded labour rate $50/hour and not the vendor-default $30/hour?
Vendor-marketing labour-cost calculators routinely use $30/hour because that is the unloaded median wage for "video editor" or "transcriptionist" in BLS data. That number is wrong for the role that actually does this work in mid-market L&D. Caption correction in those orgs is done by an instructional designer, an L&D operations specialist, or a video coordinator, and the loaded rate (cash + benefits + payroll taxes + equipment + overhead at 1.3×) for those roles in 2026 is $48–$58/hour at the median; we rounded to $50. If your team genuinely has a dedicated transcriptionist at $30/hour loaded, halve our table; the conclusion still holds because the labour line is still 10× the software line.
What about the cost of training the team on new captioning software?
Real but small for the GlossCap shape specifically. The user-facing workflow is: drag a video file, paste a glossary or link a Notion folder, click run, review the output in a side-by-side editor, click export. A new user becomes proficient within their first two corrections. The bigger transition cost is in the back-catalogue cleanup, which is a one-time labour spike of roughly the same magnitude as a normal month's correction work; we typically see it amortise inside the first 60 days because the corrected back-catalogue stops generating fresh inbound complaints.
Does buying captioning software let me reduce L&D operations headcount?
Almost never, and we would push back on framing it that way. The honest framing is: the headcount you have is producing $X of value today; reallocating the 25–75% of their time currently spent on correction to higher-leverage work shifts the same headcount up to producing $1.25X to $1.75X. The L&D ops job market is structurally short-staffed for the workload most orgs have; you almost certainly have more high-value work than your current headcount can do, and the freed capacity gets spent on it. Orgs that try to use the savings to cut headcount usually find the freed work re-accumulating in 6–9 months as the production cadence grows to fill the new capacity.
How does this math change for public-university or public-sector orgs after ADA Title II?
The labour-line cost stays the same in dollars but the second-order costs jump dramatically. Public-sector orgs are now under a documented federal compliance obligation since 2026-04-24; uncaptioned content in the back-catalogue is no longer a "good practice" gap, it is a documented exposure. The accessibility-statement risk line on the half-page slide deserves more weight in those conversations; the labour line is the same. See the ADA Title II sprint plan post for a triage approach to the back-catalogue specifically.
What if our team is already happy with our caption-correction workflow?
Then this post is not for you, and we mean that. There is a real subset of orgs — usually with a dedicated captioning specialist on staff, often a public university with a long-standing accessibility office — for whom the in-house workflow is faster than any current vendor option and produces output that is both genuinely WCAG-compliant and culturally tuned to their content. In those orgs, the audit at the top of this post will return numbers that do not justify a switch, and we would tell you not to switch. GlossCap was built for the much larger group of mid-market orgs that do not have that workflow and are absorbing the labour invisibly.
Further reading
- Why 99% caption accuracy matters: the WCAG 2.1 AA threshold, with real examples
- Glossary-biased captioning: how a Whisper prompt beats YouTube auto-captions on engineering terms
- ADA Title II just became enforceable — what training teams need to fix this week
- WCAG 2.1 AA captions — the exact spec
- SC 1.2.2 Captions (Prerecorded) explained
- TalentLMS captions workflow
- Docebo captions integration
- Kaltura captions workflow
- Rev vs GlossCap — per-minute vs flat-monthly cost dynamics
- 3Play vs GlossCap — accuracy-tier pricing breakdown
- Verbit vs GlossCap — when enterprise pricing is and is not justified
- Live demo: caption-mangle scanner
- Why we exist