Engineering · Published 2026-06-01
The LMS caption ingestion workflow: bulk retrofit across TalentLMS, Docebo, Absorb, Kaltura
Most L&D teams arrive at caption retrofit via compliance pressure — a deadline, an OCR complaint, a Joint Commission survey finding, or an accessibility audit that surfaces a catalogue of uncaptioned training video. The work that follows is not just a captioning problem. It is a data pipeline problem. Getting 400 SRT files onto 400 video assets across a mixed LMS and video-host estate requires understanding the upload surface of each platform, the format expectations of each player, the language-code conventions of each API, and the normalisation failures that corrupt caption timing silently enough that the file appears to load while the text is half a sentence behind the audio. This post walks the full engineering workflow for the four most common mid-market LMS platforms — TalentLMS, Docebo, Absorb, and Kaltura — plus the three video hosts most commonly found inside or alongside them: Panopto, Vimeo, and Wistia. The goal is a repeatable, scriptable pipeline that an L&D operations engineer can run against a back catalogue without manually touching each asset.
TL;DR
Each of the four platforms has a different caption ingestion path. TalentLMS accepts SRT sidecars uploaded through its unit-level API at POST /api/v1/courses/{id}/units/{unitId}/subtitle; the file must be UTF-8 without BOM and the timing separator must be a comma (SRT format), not a period (VTT format) — TalentLMS rejects VTT timing syntax in files uploaded to its subtitle endpoint even if the file extension is .srt. Docebo manages subtitle tracks through its Learning Management API at the video-asset level, accepts VTT natively, and requires BCP-47 language tags (use en-US, not en; Docebo's API rejects bare two-letter ISO 639-1 codes at the subtitle track creation endpoint). Absorb LMS has no public bulk-upload API for captions as of its current documentation — bulk retrofit requires the administrative import CSV workflow for lesson creation followed by caption file upload per lesson via the admin UI, or a partner-channel API call that the Absorb professional-services team enables on request for enterprise accounts. Kaltura has the deepest caption API of the four — the caption_captionasset service provides full lifecycle management (add, set content, get, list, delete) and is the right integration point for any volume above 20 assets. Kaltura also introduces the REACH caption-ordering service as a parallel path that short-circuits local ASR if the account has a REACH subscription. For video hosts: Panopto accepts SRT upload via its API and auto-captions via ASR that you can override; Vimeo accepts VTT only via its text tracks API (POST /videos/{video_id}/texttracks); Wistia accepts SRT and VTT via its captions API. Format normalisation failures — UTF-8 BOM, Windows line endings, SRT-style commas in VTT files, bare ISO 639-1 language codes — account for the majority of silent ingestion failures across all seven platforms.
Why bulk caption retrofit is harder than it looks
A team that has never done a bulk caption retrofit will typically estimate the effort by multiplying video count by a per-video cost and stopping there. The estimate is wrong in two structural ways. First, it assumes that a finished caption file can be attached to a video asset in a single uniform operation — but each platform has a different upload surface, and that surface changes depending on how the video was added to the platform in the first place. A video uploaded directly to TalentLMS behaves differently from a YouTube-embedded video in a TalentLMS course, and both behave differently from a SCORM package that contains an MP4 with a WebVTT sidecar. Second, the estimate assumes that generated captions are ready to upload without normalisation — but the format requirements of each platform are strict in ways that are not always documented, and the failures are often silent: the file loads, a caption track appears in the player, but the timing is off by a consistent delta, or the special characters in chemical names are garbled, or the first two cues are displayed and the rest are missing.
The compliance context amplifies the stakes. WCAG 2.1 AA Success Criterion 1.2.2 requires synchronised captions for all prerecorded audio-only and audio-visual media — but the standard does not require that captions merely exist on the page. The DCMP Captioning Key accuracy standard (≥99% accuracy on 600-second samples) requires that captions be legible and correctly timed. An SRT file where the timing drift is 800ms due to a line-ending normalisation failure may survive a platform's file-validation check while failing a human accuracy audit. That distinction matters when the audit is conducted by an OCR investigator or a Joint Commission surveyor who is sampling caption quality against actual audio, not just verifying that a caption track is present.
The volume at which organisations arrive at this problem has also grown since 2026-04-24 (the ADA Title II enforcement date for large state and local government entities). A public university with a 3,000-hour back catalogue of lecture captures and compliance training modules cannot approach this problem manually. Neither can a 200-employee SaaS company with 18 months of onboarding video and product-training content hosted across Docebo, Panopto, and Wistia. The engineering workflow described in this post is designed for that scale — and it is the precondition for GlossCap's glossary-biased captioning to deliver accurate output at catalogue scale rather than at the level of individual one-off uploads.
See also: the hidden half-FTE in your L&D budget for the labour-cost framing and why 99% caption accuracy matters for the compliance accuracy standard that the pipeline needs to produce.
The caption format landscape
SRT (SubRip Text)
SRT is the oldest and most universally accepted caption format. Its structure is simple: a sequence number, a timing line, one or more lines of caption text, and a blank line as a record separator. The timing line uses the format HH:MM:SS,mmm --> HH:MM:SS,mmm — the fractional-second separator is a comma, not a period. This is the most common source of SRT-to-platform rejection failures: files generated by tools that use the VTT convention (period as decimal separator in timing lines) will upload without error on some platforms but display incorrectly on others, because the player interprets the period as a format error and falls back to a nearest-valid-cue heuristic.
SRT does not support inline style markup beyond a limited subset of HTML tags (<b>, <i>, <u>, and <font color=…>) — though most LMS players strip all markup from SRT files during ingestion anyway. SRT files must be UTF-8 encoded without a byte-order mark. Files saved from Windows Notepad default to UTF-8 with BOM, and several platform ingestors — including TalentLMS's subtitle endpoint and Absorb LMS's admin importer — fail silently on BOM-prefixed files: the file is accepted, a caption track appears in the player, but the first cue is corrupted or absent because the BOM prefix is interpreted as caption text. See the SRT captions reference for a full format specification.
WebVTT (Web Video Text Tracks)
WebVTT is the W3C standard format for timed text in HTML5. It uses a period as the fractional-second separator in timing lines (HH:MM:SS.mmm --> HH:MM:SS.mmm), must begin with the string WEBVTT on the first line (optionally followed by a header block), and supports a richer cue-setting syntax than SRT: positioning (align:start, position:20%, line:5), region-based layout, and voice spans (<v Speaker>). Most LMS players ignore the extended cue-setting syntax and render all cues at the default bottom-centre position regardless of VTT positioning directives — which matters when a script has multiple speakers that the caption producer has positioned distinctly for readability. For bulk retrofit, assume the positioning metadata will be stripped and write captions accordingly.
VTT is the native format for Vimeo's text track API, Docebo's subtitle track API, and Kaltura's internal storage. Platforms that claim to accept both SRT and VTT typically convert SRT to VTT internally at ingestion — which means the normalisation problem (ensuring the SRT file is correctly formatted before upload) is better addressed at the source than relied upon the platform's converter to handle gracefully. See the VTT captions reference for format specification details.
TTML and other formats
Timed Text Markup Language (TTML) is used primarily in broadcast and SCORM packaging contexts — it is the underlying format for DFXP (Distribution Format Exchange Profile) and is supported by several SCORM authoring tools including Articulate Storyline and Lectora. For LMS ingestion via API, TTML is rarely the right choice: none of the four primary platforms described in this post accept TTML at their subtitle/caption API endpoints. TTML belongs in the SCORM package layer (where the authoring tool generates it) rather than the LMS caption API layer. See the TTML captions reference for the SCORM-context details.
BCP-47 language codes
Every LMS caption API requires a language tag to identify the subtitle track. The standard is BCP-47 (IETF Best Current Practice 47, RFC 5646). The most important rule for LMS integration: use region-qualified tags (en-US, en-GB, fr-FR, de-DE) rather than bare ISO 639-1 two-letter codes (en, fr, de). Docebo's subtitle track creation endpoint will return a 422 validation error for bare en. TalentLMS accepts bare en at the API level but the player may not match it correctly against the browser's preferred language, resulting in captions not autoselecting for users who have set their browser language preference. The safe default for US English is en-US. For Canadian French content (common in AODA-scoped Brightspace and Canvas deployments), use fr-CA rather than fr-FR — the player language-matching logic distinguishes the two and users with a fr-CA browser preference will not autoselect a fr-FR track.
For multi-language catalogues, maintain a normalised language-code table keyed to your organisation's content inventory before beginning bulk upload — correcting language codes after the fact requires a delete-and-recreate cycle on most platforms because the language tag is set at caption-track creation and is immutable without deletion.
TalentLMS caption ingestion
Caption surfaces in TalentLMS
TalentLMS hosts video in three distinct ways, each with a different caption surface. The first is direct upload — a video file uploaded to a TalentLMS course as a Video unit type. Captions for directly uploaded video are uploaded as a sidecar SRT file through the unit's subtitle settings. The second is SCORM packaging — a SCORM 1.2 or SCORM 2004 package uploaded as a SCORM unit type, where the package itself may contain a WebVTT or TTML sidecar embedded alongside the video asset in the SCORM ZIP. In this case, TalentLMS does not process the caption track at the LMS level — it is the SCORM player (the JavaScript runtime inside the SCORM package) that renders the captions, and the sidecar must be placed correctly within the SCORM package structure, typically at a path like content/subtitles/en-US.vtt that the authoring tool's manifest references. The third is external embed — a YouTube or Vimeo video embedded in a TalentLMS course as an iFrame unit type. For embedded video, TalentLMS has no caption control surface at all: captions are managed entirely on the source platform (YouTube Studio for YouTube embeds, Vimeo's text track system for Vimeo embeds). The LMS-side caption audit needs to identify which of the three surfaces applies to each video asset before beginning the bulk upload plan.
The TalentLMS subtitle API
For directly uploaded video units, TalentLMS exposes a subtitle management endpoint through its REST API (v1, Basic Authentication with the account's API key). The relevant endpoint is:
POST /api/v1/courses/{courseId}/units/{unitId}/subtitle
Content-Type: multipart/form-data
file=@en-US.srt
language=en-US
The language parameter accepts BCP-47 tags. The file field must be an SRT file — TalentLMS's subtitle endpoint does not accept VTT at the upload surface even though the player will render VTT if a VTT file is attached with an .srt extension. The platform runs a format check on the file content, not just the extension, and VTT files with the leading WEBVTT header line are rejected with a 400 response and a generic "invalid format" error message that does not specify what was wrong. The safe path is: generate SRT, verify SRT format is correct (comma timing separator, UTF-8 without BOM, Unix line endings), then upload as SRT.
To list existing units for a course (needed for the inventory phase):
GET /api/v1/courses/{courseId}/units
This returns a JSON array of unit objects. Each object includes a type field — look for type: "video" to identify video units. The unit_id field is the identifier for the subtitle upload endpoint. For SCORM units (type: "scorm"), the LMS-side caption API is not relevant — the SCORM package must be repacked.
TalentLMS bulk retrofit pattern
The bulk retrofit for a TalentLMS account follows this sequence: (1) list all courses via GET /api/v1/courses; (2) for each course, list units and filter to type: "video"; (3) for each video unit, check whether a subtitle already exists by inspecting the unit detail response for a subtitles array; (4) for units without subtitles, or units where the existing subtitle needs replacement, download the video asset URL from the unit detail, run it through the caption generation pipeline, normalise the output SRT, and POST to the subtitle endpoint. Step (3) is critical for re-run safety: the TalentLMS subtitle endpoint on some account configurations creates a duplicate track rather than replacing an existing one, and duplicate tracks present as two caption options in the player UI — confusing to learners and a caption-quality audit finding. Check for an existing subtitle record before uploading, and if one exists, delete it first via DELETE /api/v1/courses/{courseId}/units/{unitId}/subtitle/{subtitleId} before creating the replacement.
Rate limiting on the TalentLMS API is account-tier dependent but generally allows 1,000 requests per hour on standard accounts. For a catalogue of 500 video units, the inventory phase (one GET per course + one GET per unit) will run close to that limit on a typical 20-course account. Spread requests across a 90-minute window with a 3.5-second inter-request sleep to stay below the limit safely. The subtitle upload itself is outside the standard rate count on most TalentLMS plans (it is a multipart upload, not a REST API call in the quota-tracked sense) but batch them at no more than 50 per hour to avoid triggering the platform's abuse-detection heuristics.
Docebo caption ingestion
Caption surfaces in Docebo
Docebo manages video content across two distinct product areas with different caption paths. The first is formal courseware — ILT (Instructor-Led Training) sessions and eLearning courses that contain video assets uploaded directly to the Docebo asset library. The second is Docebo Coach & Share (formerly Docebo Shape), the social-learning module where users can upload video directly to channels. The caption API described below applies to the formal courseware path. Coach & Share video does not expose a subtitle track API on most Docebo plans — caption management for Coach & Share content requires either the Docebo Content Partner integrations (e.g., a Kaltura integration that handles captions at the Kaltura layer) or manual per-video caption upload through the Coach & Share admin UI.
For SCORM packages uploaded to Docebo, the same rule applies as for TalentLMS: captions embedded in the SCORM package are handled by the SCORM player, not by Docebo's subtitle track system. The LMS cannot inject a caption track into a SCORM package post-upload without re-authoring the package. Identify SCORM-wrapped video early in the inventory phase and route it to the package-reauthoring track rather than the API-upload track.
The Docebo subtitle track API
Docebo's Learning Management API (v1.0) exposes subtitle track management at the video-asset level, not the course level. A video asset in Docebo has a unique asset ID, and subtitle tracks are attached to the asset rather than to any particular course that references the asset. This is architecturally different from TalentLMS, where captions are per-unit (per-course-appearance). The implication for bulk retrofit: a video asset that is reused across multiple Docebo courses needs its caption added only once at the asset level, and all courses that reference it will immediately reflect the captioned version. This makes the Docebo bulk retrofit more efficient for organisations that reuse assets across compliance curriculum.
The subtitle track creation endpoint:
POST /learn/v1/manage/videos/{video_id}/subtitles
Content-Type: application/json
Authorization: Bearer {oauth_token}
{
"language": "en-US",
"label": "English (US)",
"default": true
}
This creates the subtitle track record and returns a subtitle_id. The actual VTT file content is uploaded in a separate step:
PUT /learn/v1/manage/videos/{video_id}/subtitles/{subtitle_id}/content
Content-Type: multipart/form-data
file=@en-US.vtt
Docebo requires VTT format for the content upload — SRT files are not accepted at this endpoint. Files must begin with the WEBVTT header line. The language field must be a valid BCP-47 tag; as noted in the format section, bare ISO 639-1 codes are rejected with a 422 error. The default boolean controls whether this track is selected by default in the player for users who have not set a caption preference — set it to true for the primary-language track and false for any additional language tracks.
To list existing subtitle tracks for a video:
GET /learn/v1/manage/videos/{video_id}/subtitles
To delete a subtitle track before replacing it:
DELETE /learn/v1/manage/videos/{video_id}/subtitles/{subtitle_id}
Finding video asset IDs in Docebo
The Docebo API does not provide a single endpoint to list all video assets in the account. The inventory approach is: (1) list all courses via GET /learn/v1/courses with pagination; (2) for each course, list course objects (the learning objects inside the course) via GET /learn/v1/courses/{course_id}/objects; (3) for each course object of type video, extract the asset_id. De-duplicate asset IDs across courses (the same asset ID may appear in multiple courses) before proceeding to the caption step — uploading captions twice to the same asset_id is idempotent on the POST (Docebo will return a 409 Conflict for a duplicate language on an asset), but it adds unnecessary API calls and delays the pipeline.
Docebo OAuth token management
Docebo's API uses OAuth 2.0 client credentials. Token expiry is typically 3,600 seconds (1 hour). For a bulk retrofit of 500+ assets, the pipeline will exceed one token lifetime. Implement automatic token refresh: check the token expiry timestamp before each API request and re-authenticate if within 60 seconds of expiry. Do not store the Docebo API credentials in the pipeline script — use an environment variable or a secrets manager. The Docebo client_id and client_secret are account-level credentials that can be used to issue tokens for any user's context; treat them as sensitive as a production API key.
Absorb LMS caption ingestion
Caption surfaces in Absorb
Absorb LMS hosts video primarily through its online course module, where each lesson (the atomic content unit) can contain a video asset. Absorb has two video paths: hosted video (uploaded directly to Absorb's CDN) and embedded video (an external URL or embed code). For hosted video, Absorb provides a Captions tab in the lesson admin interface where a single SRT file can be uploaded per lesson. For embedded video from Vimeo, YouTube, or a direct MP4 URL, Absorb does not provide a caption surface — captions must be managed at the source platform.
Absorb Infuse, the embeddable player product, also accepts SRT sidecars for captions on video content embedded in non-LMS contexts (intranet pages, custom portals). The Infuse caption path is separate from the LMS lesson caption path and uses a different configuration mechanism (a data attribute on the embed code rather than an API call).
Absorb and the bulk API gap
As of mid-2026, Absorb does not publish a bulk caption upload API in its standard developer documentation. The public Absorb REST API covers course management, user enrollment, and reporting, but the caption endpoint for individual lessons is not documented in the publicly available API reference. This is the most significant operational constraint of the four platforms: Absorb requires a different approach for bulk retrofit than the other three.
There are three practical paths for bulk Absorb caption retrofit. The first is the administrative import CSV approach: Absorb supports a lesson import format (accessible through Admin Panel → Course Management → Import) that allows batch creation of lessons with media assets. If the retrofit involves creating new lesson versions (replacing the lesson entirely with a properly captioned version rather than attaching a caption to the existing lesson), this path works for volume but requires care with enrollment preservation — replacing a lesson severs completion records for learners who completed the original lesson, which breaks the compliance documentation trail. This path is appropriate for new courses that have not yet been assigned to learners, not for back-catalogue retrofit of active courses.
The second path is the partner API path: Absorb does expose a caption management endpoint to enterprise accounts and integration partners, accessible through the Absorb Integration Services team. If your organisation has an enterprise Absorb contract, request API access to the caption endpoint through your account manager. The endpoint structure when available follows a pattern similar to the Docebo API — a lesson-level subtitle asset with a separate content-upload step — but the exact schema requires documentation from the Absorb integration team and should not be assumed from the public API reference.
The third path is automated UI interaction — using a browser automation tool (Playwright, Puppeteer, or Selenium) to navigate the Absorb admin interface programmatically and upload SRT files through the Captions tab UI. This is the path of last resort for organisations without enterprise API access and with a large back catalogue. It is slower (typically 2–4 minutes per lesson due to page-load time and file-upload latency), more brittle (any Absorb UI update can break the automation), and does not provide structured error reporting, but it works and it does not require API access. For catalogues below 200 lessons, the automated UI path is often faster to build and run than negotiating partner API access.
Absorb SRT format requirements
When uploading via the Absorb admin UI (manually or via automation), the SRT file requirements are: UTF-8 without BOM (the BOM causes the first cue to be rendered as text with a visible  prefix in the player), Unix line endings (Windows CRLF line endings cause Absorb's player to misparse cue boundaries on approximately 15% of cues in practice, resulting in lines running together without the expected display-and-clear cycle), sequence numbers starting at 1 (Absorb's parser does not accept non-sequential or zero-indexed cue numbering), and a maximum cue duration of 7 seconds (cues longer than 7 seconds are not rejected by the uploader but are silently truncated at display time — the player displays the first 7 seconds of text and then clears without completing the cue). The 7-second limit is particularly relevant for compliance training video that includes dense regulatory text read at a slow pace by the narrator — these cues naturally run long and must be split at the normalisation step.
Kaltura caption ingestion
Caption surfaces in Kaltura
Kaltura is the most API-complete platform in the set. It serves as both a standalone video platform and as the media management layer embedded inside other LMS platforms — Kaltura MediaSpace is a common companion deployment alongside Canvas LMS, Moodle, and Brightspace. When Kaltura is embedded inside an LMS via LTI (Learning Tools Interoperability), captions managed through the Kaltura API are reflected immediately in the LTI-embedded player within the LMS without any additional synchronisation step — which makes the Kaltura API the right primary integration point for organisations using a Kaltura-backed LMS, rather than the LMS's own caption interface (which may only expose a subset of what the Kaltura API can do).
Kaltura's caption architecture has three distinct layers. The first is the caption asset — the actual SRT or VTT file stored as an asset attached to a Kaltura media entry. The second is the caption profile — a configuration set that defines the default caption behaviour for new media entries (which languages to auto-generate, whether human-captioning is ordered automatically, what accuracy tier is required for REACH captions). The third is the REACH integration — Kaltura's caption ordering service that submits media to a human or AI captioning vendor and returns the caption asset asynchronously. For bulk retrofit, the relevant layer is caption assets — the caption profile is a global configuration concern, and REACH is an optional outsourcing path that runs in parallel to direct SRT/VTT upload.
The Kaltura caption asset API
Kaltura's API client library is available for PHP, Python, Java, Node.js, Ruby, and C#. The API uses a session token (ks) obtained via session.start. For bulk operations, use an application-type session (type 2, administrator) rather than a user-type session (type 0) — application sessions do not expire on inactivity and are appropriate for unattended batch processes. The session is tied to the partner ID and admin secret of the Kaltura account.
The caption asset lifecycle for uploading a new SRT file to a Kaltura media entry:
Step 1 — Create the caption asset record:
captionAsset.add(entryId, captionAsset)
captionAsset = {
language: "English",
label: "English (US)",
languageCode: "en",
isDefault: KalturaBoolean.TRUE,
format: KalturaCaptionType.SRT
}
Note: Kaltura uses its own language enumeration (the string "English" rather than the BCP-47 code en-US) in the language field of the caption asset object. The languageCode field is the two-letter ISO 639-1 code. For VTT files, set format: KalturaCaptionType.WEBVTT. This step returns a caption asset ID.
Step 2 — Upload the caption file content:
uploadToken.add()
uploadToken.upload(uploadTokenId, fileData)
captionAsset.setContent(captionAssetId, contentResource)
contentResource = KalturaUploadedFileTokenResource(
token = uploadTokenId
)
Kaltura's file upload is a two-step process: first create an upload token, then upload the file content to the token, then associate the uploaded content with the caption asset. This is different from the direct multipart-form upload used by TalentLMS and Docebo. For large batches, reuse the upload token creation step — each token can be used for exactly one file upload, but tokens can be created in batches before the upload loop begins to reduce per-file API call count.
Step 3 — Verify the caption asset is ready:
captionAsset.get(captionAssetId)
Check that the returned status field equals KalturaCaptionAssetStatus.READY (value 2). A status of QUEUED (value 1) means the asset is still being processed — Kaltura runs the uploaded file through a parsing and storage pipeline asynchronously. For SRT and VTT files under 1MB (which covers virtually all caption files for videos under 3 hours), the processing time is typically under 30 seconds. Poll with a 5-second interval and a 120-second timeout; if the asset is not READY after 120 seconds, log it as a deferred item and continue the batch.
Listing media entries for bulk retrofit
To enumerate all media entries in a Kaltura account:
media.list(filter, pager)
filter = KalturaMediaEntryFilter(
mediaTypeEqual = KalturaMediaType.VIDEO,
statusIn = "2" # READY status only
)
pager = KalturaFilterPager(
pageSize = 100,
pageIndex = 1
)
Paginate through all results (the objects array in the response plus the totalCount tell you how many pages to fetch). For accounts with more than 10,000 media entries, use the createdAtGreaterThanOrEqual filter to chunk the enumeration by date range to avoid timeout issues on the media.list call.
For each media entry, check existing caption assets:
captionAsset.list(filter)
filter = KalturaCaptionAssetFilter(
entryIdEqual = entryId
)
If the result contains a READY caption asset with the correct language and format, skip the entry (or compare timestamps if you want to replace a caption generated before a glossary was available). If no caption asset exists or only a MACHINE (auto-generated, lower-quality) caption asset exists, queue the entry for new caption upload.
Kaltura REACH
Kaltura REACH is a caption ordering integration built into the Kaltura platform. If your Kaltura account has a REACH subscription, you can order captions for a media entry programmatically:
reach_reachProfile.add()
reach_entryVendorTask.add(entryId, reachProfileId, taskType)
REACH supports both human captioning (higher accuracy, higher cost, longer turnaround) and AI captioning (lower cost, faster, lower accuracy on domain-specific vocabulary). For compliance training content with heavy proper-noun density — drug names, chemical names, product terminology, regulatory citations — REACH AI captioning will exhibit the same accuracy failures as any other generic ASR service. The correct path for this content is: generate captions with GlossCap's glossary-biased pipeline, then upload the corrected SRT/VTT via the caption asset API rather than ordering through REACH. REACH is the right choice for general-vocabulary content (meeting recordings, presentation captures) where glossary biasing provides minimal lift over generic ASR.
Video host ingestion: Panopto, Vimeo, Wistia
Many L&D estates use a video management platform alongside or instead of LMS-native video hosting. The three most common mid-market choices are Panopto (strong in higher education and enterprise, with deep LMS integration via LTI), Vimeo (strong in creative and mid-market corporate, with a clean API), and Wistia (strong in marketing and sales-enablement L&D, with advanced engagement analytics). Each has a distinct caption API.
Panopto caption ingestion
Panopto provides caption management through two mechanisms: automatic speech recognition (ASR) captions generated by Panopto's built-in transcription service, and manual caption upload (sidecar SRT). For bulk retrofit, the relevant path is sidecar SRT upload via the Panopto REST API.
Panopto uses OAuth 2.0 for API authentication (client credentials flow). The base URL is https://{your-panopto-domain}/Panopto/api/v1. The caption upload workflow:
Step 1 — List sessions (video content):
GET /folders/{folderId}/sessions?maxResults=100&pageNumber=0
Panopto organises content in folders. Enumerate folders recursively from the root and collect session IDs. Each session has a Id (UUID), Name, and Duration.
Step 2 — Check existing captions:
GET /sessions/{sessionId}/captions
Returns an array of caption records. Each record has a Language field (ISO 639-1 two-letter code in Panopto's API, despite the BCP-47 recommendation for other platforms — Panopto uses "en" not "en-US"), a IsDefault boolean, and a Type field indicating "SRT" or "Machine" (ASR). Panopto-generated ASR captions have type "Machine". If the session has only a "Machine" caption record and no "SRT" record, it is a candidate for replacement.
Step 3 — Upload SRT:
POST /sessions/{sessionId}/captions
Content-Type: multipart/form-data
file=@en.srt
language=en
Panopto accepts SRT only at this endpoint. The uploaded SRT overrides the display of any existing ASR captions — both tracks exist in the session record, but the manually uploaded SRT is displayed by default and the ASR track is demoted to an alternative.
Panopto's LTI integration with Canvas, Moodle, and Brightspace means that captions uploaded to Panopto sessions are immediately visible in the LMS embed without any action on the LMS side. This makes Panopto the natural single caption-management layer for organisations where the LMS hosts Panopto-sourced video via LTI — manage captions in Panopto, not in the LMS.
Vimeo caption ingestion
Vimeo exposes caption management through its text tracks API. Vimeo accepts VTT only — SRT files are not accepted at the text tracks endpoint and will be rejected with a 400 response. The Vimeo API uses OAuth 2.0 (bearer token from the developer dashboard).
The text track workflow:
Step 1 — List videos:
GET /me/videos?fields=uri,name,duration&per_page=100&page=1
Authorization: Bearer {token}
The uri field is in the form /videos/{video_id}. Extract the numeric video_id from the URI string.
Step 2 — Check existing text tracks:
GET /videos/{video_id}/texttracks
Authorization: Bearer {token}
Returns an array of text track objects, each with language (BCP-47 format, e.g. en-US), type ("captions", "subtitles", or "chapters"), and active boolean. Vimeo distinguishes between "captions" (text equivalents including non-speech audio description) and "subtitles" (translation tracks without non-speech description). For WCAG compliance, use type "captions".
Step 3 — Create text track and upload VTT:
POST /videos/{video_id}/texttracks
Content-Type: application/json
Authorization: Bearer {token}
{
"type": "captions",
"language": "en-US",
"name": "English (US)"
}
This creates the text track record and returns a link field containing a pre-signed upload URL. Use a PUT request to upload the VTT file content directly to that URL:
PUT {pre_signed_link}
Content-Type: text/vtt
[VTT file content]
The VTT must be a plain-text body (not multipart form data) for the Vimeo pre-signed upload. This is different from all other platforms in the set and is the source of the most common Vimeo caption upload failure in bulk pipelines: sending a multipart-encoded body to the pre-signed URL returns a 400 error without a clear diagnostic message.
Wistia caption ingestion
Wistia exposes caption management through its Data API. Wistia accepts both SRT and VTT. The API uses HTTP Basic authentication with the account's API password.
List media:
GET https://api.wistia.com/v1/medias.json?type=Video&page=1&per_page=100
Authorization: Basic {base64(api:password)}
Check existing captions:
GET https://api.wistia.com/v1/medias/{media_hashed_id}/captions.json
Authorization: Basic {base64(api:password)}
Wistia uses a hashed ID (an alphanumeric string like abc1def23g) to identify media assets rather than a sequential integer ID. The hashed ID is in the hashed_id field of the media list response.
Upload captions:
POST https://api.wistia.com/v1/medias/{media_hashed_id}/captions.json
Content-Type: multipart/form-data
Authorization: Basic {base64(api:password)}
caption_file=@en-US.vtt
language=eng
Note: Wistia uses ISO 639-2/T three-letter codes (e.g. eng for English, fra for French) at this endpoint, not BCP-47 two-letter tags. This is the most unusual language-code convention in the set and is the most common cause of Wistia caption upload failures — sending en or en-US to this endpoint returns a validation error.
Wistia also supports auto-generated captions (Wistia Transcripts, using their built-in ASR) that are displayed as an editable transcript in the player. These are separate from the manually uploaded caption track and can coexist in the same media asset. For WCAG compliance purposes, the manually uploaded caption track takes priority — confirm that the player is configured to display the uploaded caption file rather than the auto-generated transcript when both are present. This is controlled in the Wistia player settings under "Captions" for each media asset.
The bulk retrofit pipeline: six phases
The complete bulk caption retrofit pipeline for a mixed LMS / video-host estate has six phases. Each phase has a distinct set of failure modes and a distinct verification step.
Phase 1: Inventory
The inventory phase enumerates every video asset in scope — LMS-hosted video, SCORM-embedded video, externally embedded video — and classifies each asset by its caption surface. The output is a structured manifest: one row per video asset with columns for platform, asset ID, course/folder/channel, current caption status (none / machine / human / unknown), duration, and creation date. The duration column is important for triage prioritisation (compliance risk is proportional to duration — a 45-minute module without captions is a higher priority than a 2-minute intro clip) and for estimating caption generation cost.
For mixed estates, run the inventory phase API calls in parallel across platforms, not serially. A 500-asset estate spread across TalentLMS (200), Docebo (150), Panopto (100), and Vimeo (50) can be inventoried in 15 minutes if the four platform inventory calls run concurrently. Running serially at 100 API calls per platform at 3.5 seconds per call gives 23 minutes for the serial case — not a dramatic difference at 500 assets, but at 5,000 assets the difference is 3.8 hours serial vs 58 minutes parallel.
SCORM-wrapped video requires a separate inventory step: download and unzip each SCORM package to inspect its manifest (imsmanifest.xml) for embedded media assets and check whether a VTT or TTML sidecar is present in the package. This step cannot be done via API and requires local file access to the SCORM packages or a download step that retrieves them from the LMS. SCORM packages with embedded video and no caption sidecar are routed to a separate "SCORM repack" track in the pipeline, not the API upload track.
Phase 2: Triage
The triage phase assigns each asset a priority tier based on compliance risk and remediation effort. A useful triage framework:
- Tier 1 (urgent): Active course, duration > 10 minutes, no captions at all, in an ADA/EAA/Section 508 scope org unit (public-facing, customer-facing, or assigned to employees with documented accommodation requests). Remediate before the next compliance audit window.
- Tier 2 (high): Active course, duration > 5 minutes, machine-generated captions only (ASR without glossary correction), in a compliance-heavy content category (regulated industry training, OSHA/HIPAA/FERPA scope). Remediate within 30 days.
- Tier 3 (medium): Active course, any duration, machine-generated captions, general business content. Remediate within 90 days.
- Tier 4 (low): Archived or inactive courses, short clips (< 3 minutes), content with existing human-quality captions. Document and monitor; no immediate action required.
The triage output is a prioritised queue that feeds Phase 3. It also produces the compliance-documentation artifact: the back-catalogue assessment that most WCAG auditors and OCR investigators will request in a findings-letter response. Having a structured triage output showing that you identified all uncaptioned assets, assessed their risk, and have a remediation schedule with tier assignments demonstrates good-faith compliance effort — which is the legal standard under ADA Title II's "program access" framework, not perfection at a single point in time.
Phase 3: Generate
The generation phase runs each Tier 1 and Tier 2 asset through the caption pipeline. For a glossary-biased pipeline, this means: (1) retrieve the audio or video asset from its source platform; (2) extract the audio track if the source is video; (3) chunk the audio into segments ≤ 25MB for Whisper API compatibility; (4) run Whisper-large with the domain-specific glossary in the initial prompt; (5) post-process the transcript to convert ASR output tokens to the correctly formatted terms (CAS numbers, H-codes, product version strings, regulatory citation formats); (6) generate timed SRT or VTT from the transcript with appropriate cue-duration splitting.
The glossary for the generation step is the single largest accuracy lever for proper-noun-dense content. See glossary-biased captioning for the implementation details and why 99% caption accuracy matters for the accuracy standard the output must meet. For drug-name and chemical-name content, see the vertical-specific accuracy benchmarks in the medical training post and the HazCom training post.
Generation throughput planning: Whisper-large via the OpenAI API processes roughly real-time (a 60-minute video takes approximately 60 minutes of API processing time across chunked segments). For a 500-hour Tier 1+2 back catalogue, single-threaded generation takes 500 hours of wall-clock time. Parallelise with 10–20 concurrent Whisper API calls: at 10 concurrent calls, 500 hours of audio takes approximately 50 hours of wall-clock time (2 days). OpenAI Whisper API rate limits (audio seconds per minute) apply — check current account limits before designing the concurrency level. For catalogues where cost is a constraint, route short clips (< 5 minutes) through a less expensive tier-1 or open-source Whisper model and reserve Whisper-large for long-form content where accuracy differences are most visible.
Phase 4: Normalise
The normalisation phase is the most operationally critical and the most commonly skipped. Generated caption files need to be normalised before upload because the output format of the ASR pipeline may not match the exact format requirements of the target platform. Common normalisation failures:
- UTF-8 BOM: Strip the BOM from SRT files generated by Windows-based tools. Python:
content = content.lstrip('') - Windows line endings: Convert CRLF to LF. Python:
content = content.replace('\r\n', '\n') - SRT timing decimal separator: Ensure commas, not periods. Python: in timing lines only, replace
.with,using a regex that matches the HH:MM:SS pattern. - VTT header: Ensure the file begins with
WEBVTT\n\n(two newlines after the header line are required; a single newline produces a parse error in some VTT parsers including Chrome's MediaSource implementation). - Cue duration: Split cues longer than the platform's maximum (7 seconds for Absorb, effectively unlimited for Kaltura and Vimeo). Split at a word boundary — do not split mid-word or at a punctuation mark that would leave the first half of the split as a sentence fragment.
- Empty cue text: Remove cues with empty or whitespace-only text. These are produced by some ASR pipelines for silent audio segments and are harmless but can trigger format validators on some platforms.
- Sequence number gaps: Re-number sequences to be contiguous starting at 1 after any cue-splitting or cue-removal operations. Some TalentLMS player builds reject SRT files with non-contiguous sequence numbers.
- Special characters: HTML-encode ampersands (
&), angle brackets (<,>), and smart quotes in SRT files. VTT files should not have HTML encoding in cue text — the VTT cue rendering is handled by the browser's native VTT parser, not an HTML parser, and HTML entities in VTT cue text are rendered literally rather than decoded.
Build the normalisation step as a function that takes a raw caption string and returns a normalised string, not as a file-to-file transform, so it can be composed into the upload pipeline without intermediate file I/O. Validate the normalised output against a format-specific schema (regex-based for SRT, W3C WebVTT parser for VTT) before passing it to the upload step. Log any cues that were modified during normalisation with their original and modified forms — this log is the traceability artifact for caption quality review.
Phase 5: Upload
The upload phase executes the platform-specific API calls described in the platform sections above. Structure the upload loop as a retry-with-backoff pattern: on a 5xx error, wait 30 seconds and retry up to 3 times before logging the asset as failed and continuing. On a 4xx error (client error), log the asset as failed with the error response body and do not retry — a 4xx error usually indicates a format problem (which needs human inspection) or an authentication problem (which blocks the entire batch). On a successful upload, write the platform, asset ID, caption asset ID (returned by the platform), upload timestamp, language code, and file hash to a structured upload log.
The file hash (SHA-256 of the normalised caption file content) is important for idempotency verification: if the upload pipeline is re-run (due to a crash, a rate limit hit, or a reprocessed batch), the hash comparison tells you which assets were already uploaded successfully and which need to be retried. Do not rely on the platform's "has captions" flag alone for re-run idempotency — some platforms set this flag optimistically (before the caption file has finished processing) and then clear it if the processing fails, which can lead to assets being treated as "done" when the caption file is actually corrupt or missing.
Phase 6: Verify
The verification phase confirms that uploaded captions are accessible and correctly rendered. Verification has two levels: programmatic verification (the API reports a caption track as present and READY) and human spot-check (a person plays a sample of videos in the LMS player and confirms that captions appear on screen, are correctly synchronised, and contain the expected vocabulary).
For programmatic verification, re-run the inventory API calls post-upload and confirm that every Tier 1 and Tier 2 asset now has a human-quality caption record in the platform's metadata. Cross-reference the upload log against the new inventory to identify any assets where the upload succeeded (log shows HTTP 200 or equivalent) but the platform does not report a caption asset (which indicates the caption was accepted but processing failed — the silent failure mode described above).
For human spot-check, sample at a 10% rate across the Tier 1 assets (or 20 assets, whichever is larger). For each sampled asset, confirm: caption track appears in the player UI; captions are synchronised to within 1 second of the audio; proper nouns in the first 5 minutes are correctly rendered (check against the glossary); and special characters (regulatory citation formats, chemical names, version strings) are displayed correctly rather than as HTML entities or encoding artifacts. Document the spot-check results — this is the compliance evidence artifact that demonstrates the retrofit pipeline produced accessible captions, not just caption files.
xAPI and SCORM: caption-interaction reporting
For organisations where compliance documentation of caption use matters — regulated-industry training under HIPAA, OSHA, FDA 21 CFR Part 11, or Section 508 federal contracts — the question of whether a learner actually received the captioned version of the training (not just that a caption track exists on the asset) is increasingly relevant in audit contexts. xAPI (Experience API, formerly Tin Can) and SCORM provide two different answers to this question.
SCORM tracks completion and score at the course level. It does not track caption-interaction events natively. A SCORM course can embed custom JavaScript that fires a Scorm API commit with caption-status metadata (whether captions were enabled during playback, what percentage of captions-enabled playtime was logged), but this requires custom authoring-tool development and is not available in standard SCORM packages. For SCORM-based compliance training, caption availability (the file exists and was uploaded) is the standard documentation artifact — not caption interaction.
xAPI has a richer event vocabulary. The Kaltura xAPI Integration can emit caption-viewed events when a learner enables captions during Kaltura playback. The statement structure for a Kaltura caption interaction event is:
{
"actor": { "objectType": "Agent", "account": { "name": "learner@org.com" } },
"verb": { "id": "https://w3id.org/xapi/video/verbs/interacted" },
"object": { "id": "https://kaltura.org/entries/{entry_id}" },
"context": {
"extensions": {
"https://w3id.org/xapi/video/extensions/cc-subtitle-enabled": true,
"https://w3id.org/xapi/video/extensions/cc-subtitle-lang": "en-US"
}
}
}
Capturing and storing these events in an LRS (Learning Record Store) provides a learner-level record that captions were enabled during a specific viewing session. For regulated-industry clients whose compliance documentation requirements extend to evidence of accessible delivery (not just accessible content), configuring Kaltura's xAPI integration with an LRS endpoint is worth the implementation effort. Docebo's learning record store integration (Docebo LRS) can serve as the LRS endpoint for Kaltura xAPI events if the organisation uses Docebo as its primary LMS — the integration is configured through the Docebo Connect marketplace.
For Panopto, Vimeo, and Wistia, caption-interaction xAPI events are not natively available — these platforms do not emit xAPI statements for caption enable/disable events in their standard player implementations. For these platforms, caption availability is the documentation standard.
The engineering playbook in practice: a worked example
Consider a 280-employee mid-market healthcare-adjacent SaaS organisation (similar to the RFP scenario in the captioning RFP playbook). The organisation uses Docebo as its primary LMS, Panopto for all video production and lecture capture, and Wistia for the customer-facing product-training portal. The compliance trigger is an ADA Title II assessment that identified 340 captioned or uncaptioned videos across all three platforms, of which 127 have no captions and 89 have machine-generated captions from each platform's built-in ASR (accuracy 84–88% on pharmacology and product-terminology content).
The inventory phase takes 3 hours: one hour for Docebo (310 video assets across 28 active courses, 12 archived courses), one hour for Panopto (178 sessions across 22 folders), one hour for Wistia (94 media assets in 6 channels). Cross-reference shows 340 unique assets total after de-duplication (some Panopto sessions are embedded in Docebo courses via LTI — these are managed at the Panopto layer, not the Docebo API layer). Triage assigns 62 assets to Tier 1 (urgent), 104 to Tier 2, 118 to Tier 3, 56 to Tier 4.
The generation phase runs 62 Tier 1 assets through GlossCap's glossary-biased pipeline with a 340-term glossary (drug names, product terminology, regulatory citations). At 20 concurrent Whisper API calls, the 62 assets (average duration 28 minutes, total 29 hours of audio) complete in approximately 87 minutes of wall-clock time. The post-processing step resolves 11 assets with embedded-numeral format issues (NDC codes, ICD-10 codes) that required the post-processing rule layer beyond glossary biasing.
The normalisation phase runs in 4 minutes for the 62 assets: UTF-8 BOM found in 7 assets (the audio extraction tool used on Windows machines), CRLF line endings found in 12 assets, 3 assets have cues over 7 seconds that are split. Total: 62 clean VTT files for Docebo and Panopto, 62 clean SRT files for Wistia (Wistia's API acceptance of both formats is used here for the convenience of the SRT output from the pipeline).
The upload phase takes 45 minutes: 38 assets to Panopto (via the session caption API), 16 assets to Docebo (via the video asset subtitle API — some Docebo assets are already on Panopto), 8 assets to Wistia. Two Panopto uploads fail with a 500 error on first attempt, succeed on retry. One Wistia upload fails with a 422 error (language code en was passed incorrectly; corrected to eng and re-run). All 62 assets uploaded successfully after retries. Upload log written.
The verification phase takes 2 hours: programmatic inventory re-run confirms 62 READY caption assets across all three platforms. Human spot-check of 10 assets (16%) across clinical pharmacology, product onboarding, and regulatory compliance content confirms synchronisation within 1 second, all proper nouns correctly rendered, no HTML entity rendering errors. Compliance documentation package assembled: triage manifest, upload log, spot-check results.
The entire Tier 1 retrofit — from first API call to completed verification documentation — takes approximately 7 hours of elapsed time (3 hours inventory, 1.5 hours generation, 0.5 hours normalise + upload, 2 hours verify). The Tier 2 batch (104 assets) follows the same pattern over the next 2 weeks. Total compliance documentation time: under 30 hours for the full 166 Tier 1+2 assets.
Common failure modes and their diagnostics
| Symptom | Platform | Root cause | Fix |
|---|---|---|---|
First cue shows as "1" in player |
TalentLMS, Absorb | UTF-8 BOM at start of SRT file | Strip BOM before upload |
| File accepted, no captions appear in player | TalentLMS | VTT file uploaded to SRT-only endpoint (WEBVTT header line) | Convert to SRT before upload |
| API returns 422 on subtitle creation | Docebo | Bare ISO 639-1 language code (en instead of en-US) |
Use BCP-47 region-qualified tag |
| API returns 422 on caption upload | Wistia | BCP-47 tag passed (en or en-US) instead of ISO 639-2/T three-letter code |
Use eng, fra, deu etc. |
| Pre-signed URL upload returns 400 | Vimeo | Multipart form data sent to pre-signed URL instead of raw VTT body | Send plain VTT content as request body, not multipart |
| Captions off by constant delta (e.g. 200ms fast) | Any | Audio extraction added a header silence that shifts all timing; or pipeline used a different sample rate reference than the platform's player | Add a timing offset correction pass; verify with a known-good reference segment |
| Lines run together (no clear/display cycle) | Absorb | CRLF line endings — cue boundaries parsed incorrectly | Normalise to LF before upload |
| Kaltura caption status stuck at QUEUED after 5 minutes | Kaltura | File exceeds 10MB (very long video), or file encoding issue prevents processing | Check file size; re-encode as UTF-8 LF; re-upload |
| Caption track appears in LMS but not in embedded Panopto player | Panopto via LTI | Caption uploaded to LMS (e.g. Docebo subtitle API) instead of Panopto session caption API — LTI embed ignores LMS-layer captions | Upload caption to Panopto session, not to Docebo video asset |
Special characters show as HTML entities (&) in player |
Docebo (VTT) | HTML-encoded entities in VTT cue text — VTT does not use HTML encoding | Decode HTML entities in VTT normalisation step; encode only for SRT |
Seven-day engineering sprint plan
For an L&D operations engineer or a technical L&D lead starting a bulk caption retrofit from scratch, a realistic 7-day sprint plan:
Day 1: Set up API authentication for all platforms in scope. Run the inventory API calls manually (not yet automated) for each platform and export to a CSV. Classify each asset by platform, caption surface, and current caption status. Identify SCORM-wrapped video assets that need the repack track. Total output: inventory manifest with ~200–500 rows depending on catalogue size.
Day 2: Triage the inventory into Tier 1–4. Prioritise by compliance risk (accommodation requests, active regulatory audit, upcoming Joint Commission survey, ADA/EAA enforcement exposure). Build the compliance-documentation package: triage rationale, remediation schedule by tier, responsible parties. Share with legal/compliance for sign-off on the triage criteria.
Day 3: Build the normalisation function. Write unit tests for each failure mode in the table above. Test against known-bad files: BOM-prefixed SRT, CRLF SRT, VTT without WEBVTT header, SRT with period timing separator, VTT with HTML entities. Confirm the normalised output passes format validation for each target platform.
Day 4: Build the upload wrappers for each platform (TalentLMS, Docebo, Panopto, Vimeo, Wistia — whichever are in scope). Test with a single Tier 1 asset per platform. Verify in the platform's player UI that the caption appears, is synchronised, and handles the vocabulary correctly. Fix any format issues found during this single-asset test before scaling to the full batch.
Day 5: Run generation for all Tier 1 assets. Build the glossary from the organisation's product glossary, the compliance-regulation vocabulary for the relevant vertical (HIPAA/OSHA/FDA/FERPA), and any domain-specific proper-noun lists from the L&D team. Review 10% of generated captions manually before proceeding to upload.
Day 6: Run normalise + upload for all Tier 1 assets. Monitor the upload log in real time. Handle retry logic for transient API failures. Confirm the upload log shows 100% of Tier 1 assets with a READY status post-upload.
Day 7: Run the verification phase: programmatic re-inventory + human spot-check. Assemble the compliance documentation package: triage manifest, upload log, spot-check results. Brief the compliance team. Begin planning the Tier 2 batch for the following sprint.
Frequently asked questions
Can I use the same caption file across multiple platforms if the same video is hosted on both Docebo and Wistia?
Yes — the normalisation step should produce format-appropriate versions from the same source transcript. Keep the source transcript (timing + text, format-neutral) as the canonical artifact, then derive the platform-specific SRT (for Panopto, TalentLMS, Absorb) and VTT (for Docebo, Vimeo) versions from it. Do not try to maintain separate source files per platform — transcript drift across platform copies is a quality-maintenance problem that compounds as the course content evolves.
Our Docebo courses embed Vimeo videos via LTI. Should I upload captions to Docebo or Vimeo?
Upload to Vimeo. When a Vimeo video is embedded in a Docebo course (or any other LMS) via the iframe embed code, caption management is at the Vimeo layer — the Docebo subtitle API has no effect on an iframe-embedded Vimeo player. The Vimeo text track API is the correct integration point. The same rule applies to Panopto-via-LTI: manage captions at Panopto, not at the LMS that embeds the Panopto player.
How do I handle back-catalogue videos in SCORM packages? The authoring tool is Articulate Storyline 360.
For Storyline 360, the caption track is embedded in the published SCORM package as a WebVTT file at a path like story_content/subtitles/subtitles_en.vtt (for single-language) or story_content/subtitles/subtitles_en_US.vtt (for multi-language published content). To retrofit captions: (1) download the published SCORM package ZIP from the LMS; (2) unzip; (3) locate the existing VTT sidecar or the absence of one; (4) generate corrected VTT; (5) replace the sidecar file or add it at the correct path per the Storyline manifest structure; (6) re-zip and re-upload to the LMS. This is a manual operation per SCORM package. Storyline 360 Source Files (.story extension) can alternatively be re-exported from the source file with the corrected VTT — if source files are available in the authoring team's repository, this is the cleaner path. If source files are not available (common for vendor-developed compliance content), the ZIP-and-replace approach works but requires maintaining the patched SCORM package separately from the vendor's distribution.
We have 1,200 videos on Kaltura. Is there a way to batch-submit them all to REACH rather than running the GlossCap pipeline locally?
Yes — the Kaltura REACH bulk submission API (reach_entryVendorTask.addBulkUploadJob) accepts a list of entry IDs and creates REACH tasks for all of them in a single API call. However: REACH AI captioning will produce the same accuracy failures as any generic ASR service on domain-specific vocabulary. For a 1,200-video catalogue that includes technical, medical, or regulatory content, the batch REACH submission will give you caption files in 24–48 hours but the accuracy on proper nouns will require a second-pass correction run. The hybrid approach — batch REACH for general-vocabulary content (meetings, general presentations, soft-skills training), GlossCap pipeline for domain-specific content (clinical, compliance, engineering, product training) — is usually the right economic choice. Use the triage classification to route each asset to the correct generation path.
How does caption ingestion interact with Section 508 and VPAT documentation?
The VPAT (Voluntary Product Accessibility Template) for a learning management system documents the LMS platform's accessibility conformance — not the content hosted in it. The LMS vendor provides the VPAT. Your organisation's Section 508 obligation as a federal contractor or federal agency is to ensure that the content you host in the LMS meets the 508 criteria (specifically WCAG 2.0 Level A and AA, which includes SC 1.2.2 for captions). Uploading captions via the ingestion workflow described in this post satisfies the content-layer obligation. The LMS's VPAT addresses the platform layer (accessible player controls, keyboard navigation, screen-reader compatibility for the LMS UI). Both layers need to be in order for a full 508 attestation — a VPAT from the LMS vendor does not cover caption quality on the content you load into the platform.
We are switching from Docebo to Cornerstone OnDemand. How should we handle caption migration?
The caption migration from any LMS to Cornerstone OnDemand (or any other platform) follows the same inventory → normalise → upload pattern. The source platform (Docebo) provides a caption download API (GET on the subtitle content endpoint) that retrieves the VTT file. Normalise from VTT to the target platform's format (Cornerstone OnDemand accepts SRT). Upload via the target platform's lesson/content API. The transition is also an opportunity to audit caption quality: check generation timestamps against when the glossary was established, and re-run any captions generated before the glossary was available through the corrected pipeline rather than carrying forward low-accuracy captions into the new platform. The captioning RFP playbook includes a vendor-transition section that covers the LMS migration context in more detail.
Is it worth building and maintaining this pipeline internally, or should we outsource the bulk retrofit to a captioning vendor?
The answer depends on catalogue velocity — how many new videos are added per month — and on how domain-specific the vocabulary is. For a catalogue that grows by more than 20 hours of video per month, and for content with a stable domain vocabulary (the product glossary, the regulatory citation set, the medical terminology), building the pipeline and maintaining the glossary is economical: the per-hour cost of glossary-biased captioning runs 40–70% lower than human-review-plus-delivery vendor pricing at that volume. The pipeline is also the compliance documentation infrastructure — the upload log, the normalisation audit trail, and the spot-check results are artifacts you own and maintain. For a small, slow-growth catalogue (under 5 hours/month) where the content is general-vocabulary, outsourcing to a vendor with a bulk-upload UI (Verbit, 3Play, Rev at the managed-service tier) and requesting delivery in SRT or VTT for LMS upload is often more cost-effective than building internal pipeline. See the GlossCap demo for the glossary-biased approach in action.
Further reading
Platform reference pages
- TalentLMS captions — subtitle surfaces, API, and compliance workflow
- Docebo captions — subtitle track API and Coach & Share caption path
- Absorb LMS captions — Captions tab, Infuse player, bulk retrofit options
- Kaltura captions — caption asset API, REACH integration, xAPI events
- Panopto captions — session caption API, ASR override, LTI embedding
- Vimeo captions — text tracks API, VTT format, training-video use cases
- Wistia captions — captions API, ISO 639-2 codes, engagement analytics
Format reference pages
- SRT captions — format specification, platform compatibility, common errors
- VTT (WebVTT) captions — W3C specification, cue settings, LMS compatibility
- TTML captions — SCORM packaging, DFXP, authoring tool output
Compliance and workflow context
- WCAG 2.1 AA captions — SC 1.2.2 and the 99% accuracy standard
- WCAG SC 1.2.2 Captions (Prerecorded) — the exact requirement
- Section 508 captions — federal contractor and agency obligations
- Compliance training video captions — OSHA, HIPAA, FERPA, ADA
- Why 99% caption accuracy matters — the DCMP standard with real training-video examples
- The hidden half-FTE in your L&D budget — the labour-cost case for pipeline investment
- Glossary-biased captioning — the Whisper implementation that drives the generation step
- The captioning RFP playbook — how to evaluate and select a captioning vendor for bulk work