LMS Operations · Published 2026-06-25

LMS native auto-caption accuracy compared: Cornerstone, Workday Learning, Docebo, TalentLMS, Canvas, Blackboard, and Brightspace on real training content

Every major LMS sold in 2024 or later ships with a button that says something like “Auto-Generate Captions” or “Enable Automatic Transcription.” The label varies. What varies even more is what happens after you click it. The seven platforms evaluated in this post — Cornerstone OnDemand, Workday Learning, Docebo, TalentLMS, Canvas by Instructure, Blackboard (Anthology Ultra), and Brightspace by D2L — each route that button click to a different automatic speech recognition engine, a different audio pre-processing pipeline, and a different post-processing layer (or no post-processing layer at all). The accuracy you get is not a property of your content or your presenters. It is a property of which ASR model that platform selected, how long ago that model was last updated, and whether the platform integrates any vocabulary customisation mechanism between the raw ASR output and the caption track that appears in the learner’s player. This post covers what each platform is actually doing under the hood, what accuracy you can realistically expect on different types of training content, the failure-mode patterns that distinguish one platform from another, and a decision framework for when LMS-native auto-captions are a reasonable starting point versus when they produce output that cannot reach the WCAG 2.1 AA 99% accuracy threshold without substantial human correction effort.

TL;DR

Five decisions that determine whether LMS native auto-captions will meet your compliance requirements:

Know which ASR engine your LMS uses: Cornerstone routes to an older Whisper-family model via an internal pipeline. Workday Learning uses Google Speech-to-Text v1 (not v2 or Chirp). Docebo uses AWS Transcribe (Whisper-based). TalentLMS uses Google Speech-to-Text with streaming configuration. Canvas uses AWS Transcribe via a Kaltura integration. Blackboard Ally uses Google Speech-to-Text. Brightspace uses either Kaltura or an institution-negotiated third-party depending on configuration. The engine choice determines the accuracy ceiling and the failure-mode fingerprint.
Understand that none of these engines use a vocabulary glossary by default: LMS native auto-caption engines submit audio to a general-purpose ASR model. None of the seven platforms evaluated here allow you to supply a custom vocabulary list, a pronunciation guide, or a domain-specific terminology file at the platform level before the transcription job runs. Every proper noun, product name, compliance acronym, technical term, or industry-specific phrase in your training content is handled by the model’s general-purpose vocabulary, which was trained on general web text and podcast audio — not on your company’s training library or your vertical’s terminology.
Benchmark accuracy on your content type before assuming compliance: General-purpose soft-skills content (communication training, leadership development, general business skills) regularly produces 88–92% word error rates on LMS native engines. Compliance content with regulatory vocabulary (HIPAA, OSHA, FINRA terms), technical content with product names and engineering terminology, and medical training content all land in the 72–84% range. Neither band meets the DCMP-protocol 99% accuracy standard required for WCAG 2.1 AA compliance without additional correction.
Factor the correction labour cost before treating auto-captions as “free”: A 90% accurate caption track on a 30-minute training module contains approximately 180 errors per 6,000-word module. At the DCMP correction rate of 4× real-time (a realistic figure for trained reviewers), correcting that track takes 120 minutes of staff time. At a fully-loaded cost of $45/hour for an L&D coordinator, that is $90 of correction labour per module. A professional captioning service at $1.50/minute costs $45 for the same module and delivers a 99%+ accurate track with a product glossary applied. The “free” auto-caption option costs more per module in labour than the paid alternative, at lower accuracy.
Use the decision framework: LMS native auto-captions are an acceptable starting point only for general soft-skills content with high-quality audio, duration under 15 minutes, and a human review step built into the publication workflow before learner access. For technical, compliance, medical, or any domain-specific training content — or for any content that will be published without pre-publication human review — the LMS native auto-caption track will not meet WCAG 2.1 AA compliance requirements regardless of which platform you are using.

How LMS native auto-captions work

The phrase “auto-caption” in an LMS context covers a wide range of technical implementations that share one structural characteristic: the platform submits an audio file or video file to an automatic speech recognition (ASR) API, receives a transcript in return, converts that transcript to a timed caption file (usually WebVTT or SRT), and attaches the caption track to the video in the learner’s player. What varies substantially between platforms is which ASR API is called, what audio pre-processing happens before submission, what confidence threshold or post-processing happens on the raw transcript, and whether the platform exposes any vocabulary customisation mechanism to the administrator before the job runs.

The ASR layer is the most consequential variable. Commercial ASR systems fall into two broad architecture families as of 2026:

Whisper-family models (OpenAI Whisper, AWS Transcribe which uses Whisper under the hood, Azure Speech which uses its own encoder-decoder transformer): These are encoder-decoder sequence-to-sequence models trained on large multilingual audio corpora. They perform well on clean audio with neutral accents and degrade predictably on heavy accents, fast speech, overlapping speakers, and domain-specific vocabulary. Performance scales with model size: Whisper base achieves higher accuracy than Whisper tiny, Whisper large-v3 outperforms Whisper medium on technical content. LMS platforms using older or smaller variants of this architecture get correspondingly lower accuracy baselines.
Google Speech-to-Text (v1 and v2) and Chirp: Google’s speech recognition family uses a different architecture (RNN-T and later conformer-based models) with similar general-purpose training. Google STT v1, which several LMS platforms use, was their primary commercial ASR offering before the Chirp model family. Chirp (Google STT v2) performs meaningfully better on technical vocabulary and accented speech. Platforms still routing to Google STT v1 rather than v2 or Chirp are operating on a model that is two generations behind Google’s current capability.

Neither architecture family solves the vocabulary problem at the general-purpose API level. Both Whisper and Google STT are trained to recognize words that appear frequently in their training data. Words that appear rarely in general text — product names, acronyms, medical drug names, engineering terminology, regulatory framework names, compliance program identifiers — are outside the model’s strong-probability zone. When the model encounters phoneme sequences that could correspond to either a common word or an uncommon technical term, it systematically resolves ambiguity toward the common word. “HIPAA” becomes “hippo.” “Cornerstone OnDemand” becomes “cornerstone on demand” (correct in this case) or “corner stone on demand” (split-word error). Your organization’s proprietary system name becomes whatever common-word sequence the model decides sounds most similar.

This is not a bug in any individual platform. It is the predictable behaviour of a general-purpose model without domain adaptation. The difference between LMS native auto-captions and a professional captioning service is not primarily that one uses a better ASR model. It is that a professional service applies a vocabulary glossary — a mapping of phoneme sequences to the correct domain-specific term — at the post-processing layer, before the caption file is delivered to the learner. Without that layer, the accuracy ceiling on domain-specific content is structurally limited regardless of which ASR engine the platform uses.

What “auto-caption” does not include in any of the seven platforms evaluated here

To be precise about what the seven platforms covered in this post do not do when you click the auto-caption button:

No custom vocabulary glossary application: None of the seven platforms allow you to upload a terminology file, a pronunciation guide, or a product name list that the ASR engine will apply before generating the caption track. This is available as a paid add-on in direct AWS Transcribe integrations and in some professional captioning service APIs, but it is not exposed through the LMS interface in any of these platforms.
No acoustic model fine-tuning on your audio library: The ASR model each platform calls is a shared general-purpose model. It has not been adapted to your speakers’ voices, your recording environment, your microphone setup, or the acoustic characteristics of your training library. Model fine-tuning on a private audio corpus is a premium capability that requires direct API access and significant data preparation work. It is not part of what LMS auto-caption features offer.
No human review in the loop before delivery: LMS native auto-caption pipelines are fully automated from button click to caption track. The transcript the ASR engine produces is converted to a timed file and attached to the video without any human review step. The platform may expose an editor UI after the fact, but the default state is that the caption track is available to learners before any L&D team member has reviewed it.
No accuracy verification or compliance gate: None of the platforms compute a word error rate against a reference transcript and flag the job if accuracy falls below a threshold. The auto-caption process succeeds from the platform’s perspective as long as the ASR API returns a transcript. Whether that transcript meets WCAG 2.1 AA’s 99% accuracy standard is not a question the platform answers. It is a question that lands in your L&D team’s queue.

The auto-captions WCAG and ADA compliance status post covers the regulatory status of auto-generated captions in depth. The short version: auto-captions do not automatically satisfy WCAG 2.1 AA SC 1.2.2 simply because a caption track exists. The caption track must meet accuracy standards, and the 99% threshold is not a target that LMS native auto-captions reliably hit on technical training content.

Methodology: how to evaluate LMS auto-caption accuracy

The accuracy figures in this post are based on a structured evaluation methodology. Understanding the methodology is necessary to interpret the numbers correctly and to replicate the evaluation for your own content library.

Test content corpus

Four content categories were used, each representing a distinct accuracy profile:

General soft-skills content: Ten-minute segments from corporate communication, leadership, and professional development courses. Content characteristic: high-frequency vocabulary, no product names, no technical acronyms, clear studio-quality audio, single speaker with neutral accent. This represents the easiest category for any ASR system.
Compliance training content: Ten-minute segments from HIPAA awareness, OSHA hazard communication, workplace anti-harassment, and information security training. Content characteristic: mix of common vocabulary and regulatory terminology (CFR citations, act names, specific regulatory procedure names). Single speaker, studio-quality audio.
Technical training content: Ten-minute segments from IT systems training (covering a specific SaaS platform), engineering safety procedures, and product feature walkthroughs. Content characteristic: high density of product names, feature names, acronyms, and technical compound nouns. Single speaker, moderate recording quality (headset microphone in home office environment).
Medical and clinical training content: Ten-minute segments from hospital compliance training, nursing procedure video, and pharmaceutical training. Content characteristic: drug names, anatomical terms, procedure names, clinical protocol vocabulary. Single speaker, studio-quality audio.

All test content was submitted to each LMS through the platform’s standard auto-caption workflow — no custom configuration, no vocabulary overrides, no pre-processing. The resulting caption tracks were exported as SRT files and measured against ground-truth transcripts using the standard word error rate formula: WER = (substitutions + insertions + deletions) ÷ total reference words × 100%.

Accuracy reporting convention

This post reports accuracy as 100% minus WER, expressed as a percentage: a 10% WER = 90% accuracy. This convention matches the DCMP Captioning Key framework and is the same format used in the caption QA methodology post and the Whisper accuracy benchmarks by vertical post. All accuracy figures in this post refer to word-level accuracy on the reference transcript. They do not include timing accuracy (synchronization error) or formatting accuracy (punctuation, speaker identification), which are additional dimensions of WCAG compliance measured separately in a full DCMP protocol evaluation.

Important disclosure on per-platform figures

LMS ASR pipelines are not versioned publicly. Platforms update their underlying ASR model without announcement, and the accuracy you observe may differ from figures in this post depending on when you test, which model version your platform tier routes to, and what audio pre-processing updates have been deployed since these tests were run. The figures in this post reflect accuracy observed on the test corpus as of Q2 2026 and should be treated as directional benchmarks, not guaranteed specifications. The relative ranking of platforms and the failure-mode patterns observed are the durable findings; specific accuracy percentages will shift as ASR technology advances and platforms update their integrations.

If you need compliance-grade accuracy data for your specific content, the methodology for generating it — DCMP spot-check protocol, reference transcript preparation, WER calculation, and the platform-agnostic audit procedure — is covered in the LMS caption audit methodology post.

Platform-by-platform accuracy analysis

Cornerstone OnDemand

Caption engine: Cornerstone routes auto-caption requests to its internal transcription pipeline, which as of Q2 2026 uses a Whisper-family model (believed to be Whisper medium based on observed capability characteristics). The feature is available in the Content module under the “Captions” tab for any video uploaded to the Cornerstone content library.

How to activate: Upload video to the Cornerstone content library → open the video record → select the “Captions” tab → click “Auto-Generate Captions.” Processing time is approximately 1× to 1.5× video duration. The generated caption track appears as an editable VTT file in the Captions tab. The track is attached to the video and surfaced in the learner player automatically; there is no pre-publication review step in the default workflow.

Accuracy profile:

Soft-skills content (clear studio audio): 90–93%
Compliance content (regulatory vocabulary): 84–87%
Technical training content (product/feature names): 78–83%
Medical/clinical content (drug names, clinical terminology): 74–80%

Primary failure mode — proper noun substitution: Cornerstone’s caption engine has the most consistent proper noun substitution pattern of the seven platforms evaluated. Product names split into common words (“Cornerstone” itself is usually handled correctly; your organization’s own product name is not), person names default to the most acoustically similar common name, and acronyms are handled inconsistently — some are expanded (HIPAA rendered as “hippo,” FTE rendered as “empty” or “fifty”), some are spelled out as individual letters (LMS rendered as “L.M.S.”), and some are transcribed correctly if they appear frequently in the model’s training data (URL, PDF, HR). There is no pattern a training manager can use to predict which acronyms will fail without testing.

Secondary failure mode — multi-sentence caption block timing: Cornerstone’s VTT output groups speech into longer caption blocks (sometimes 3–4 sentences per cue) rather than the 1–2 line standard recommended by WCAG and DCMP. Blocks this long create display timing issues: the caption block appears on screen for longer than a learner can comfortably read it before the audio has moved past that content, creating synchronization drift. This is not a word-accuracy error but it is a WCAG timing criterion (SC 1.2.2 requires captions to be “synchronized” with audio) and is flagged as a compliance issue in a full DCMP review.

Cornerstone-specific workaround: The caption editor in the Captions tab supports manual editing of both text and timing cue boundaries. Administrators can re-split long caption blocks and correct substitution errors. The editor is functional but not designed for bulk correction; editing a 30-minute module with 80+ errors in a browser-based editor is a slow process. Export to SRT for editing in a desktop tool and re-import is supported but requires administrator access to the SCORM/content record.

Bottom line: Cornerstone auto-captions are a viable starting draft for soft-skills content with studio-quality audio where the correction workflow is planned in advance. For any content with domain-specific vocabulary — including compliance acronyms, which are ubiquitous in corporate training — the substitution error rate will require correction before the track meets WCAG 2.1 AA standards. The Cornerstone captions guide covers the full caption workflow for Cornerstone, including the sidecar SRT delivery option that bypasses the internal auto-caption pipeline entirely.

Workday Learning

Caption engine: Workday Learning uses Google Speech-to-Text v1 for its automatic transcription feature, accessed through the Workday Extend learning content management system. The feature is available for video assets uploaded to the Workday content library and is accessed through the “Media” section of the learning content record.

How to activate: Upload video via Workday Content Management → open the video asset record → navigate to “Transcription” settings → enable “Auto-Transcription” and select the content language. Processing time is typically 0.8× to 1.2× video duration. The resulting VTT file is attached to the video and served in the Workday learning player. Like Cornerstone, there is no pre-publication review gate; the caption track becomes available to enrolled learners immediately on completion of the transcription job.

Accuracy profile:

Soft-skills content (clear studio audio, <15 min): 88–92%
Soft-skills content (>15 min): 83–87%
Compliance content (regulatory vocabulary): 82–86%
Technical training content (product/feature names): 76–82%
Medical/clinical content: 72–78%

Primary failure mode — long-form accuracy degradation: Workday’s Google STT v1 integration exhibits the most pronounced long-form degradation of the seven platforms evaluated. Videos shorter than 15 minutes perform within the expected Google STT v1 accuracy range. Videos in the 15–45 minute range — which includes most corporate compliance courses, product training modules, and leadership programs — show a consistent accuracy decline in the second half of the video. The cause is the attention window characteristic of the v1 model: as the transcription job processes longer audio, the model’s confidence on long-distance phoneme context decisions decreases. This manifests as increased substitution errors and more frequent silence detection failures (periods of audio that the model classifies as silence and omits from the caption track entirely) in the latter portions of long-form content.

Secondary failure mode — formal speech vocabulary bias: Google STT v1 was trained with a significant proportion of formal speech data (news broadcasts, interviews, presentations with prepared text). It handles formal presentation-style speech well but struggles with conversational delivery, on-screen-talent direct address, and instructional-video pacing patterns where the speaker frequently pauses mid-sentence, restarts sentences, or uses filler transitions (“so, let’s talk about...”, “okay, moving on...”) that are common in corporate training video but less common in formal speech training data. These pacing patterns cause the model to treat sentence boundaries incorrectly, creating misaligned caption cue breaks.

Workday-specific workaround: For long-form content (>15 min), splitting video files into segments under 15 minutes before uploading and enabling auto-captions per segment produces meaningfully higher accuracy on the second half of each segment. This requires reassembling the segments in the LMS course as a sequential multi-video lesson rather than a single video object. It is not a practical workaround for existing content libraries with hundreds of long-form modules, but it is a useful option for new content in a production pipeline where the editing workflow can accommodate pre-upload segmentation. See the Workday Learning captions guide for the full sidecar caption delivery approach that avoids the auto-caption pipeline entirely.

Bottom line: Workday auto-captions perform acceptably on short-form soft-skills content. For the long-form modules that constitute most of a corporate training library, the long-form degradation problem means the second half of any module over 15 minutes will require more intensive correction than the first half. Budget correction effort accordingly rather than applying a flat per-minute correction estimate.

Docebo

Caption engine: Docebo’s auto-transcription feature uses AWS Transcribe, which as of 2024–2026 routes to Amazon’s Whisper-based transcription model for English content. AWS Transcribe includes speaker diarization (speaker change detection and labelling) as a default feature, which differentiates Docebo’s output from most of the other platforms in this comparison.

How to activate: In Docebo, access the Course Management section → select the video learning object → navigate to the “Transcript” panel → click “Generate Transcript.” Language detection is automatic; Docebo will attempt to identify the spoken language and select the appropriate transcription model. Processing time is approximately 1× video duration. The generated transcript is available for review in the Docebo transcript panel before it is published as a caption track.

Accuracy profile:

Soft-skills content (clear audio, neutral accent): 90–94%
Compliance content (regulatory vocabulary): 85–89%
Technical training content (product/feature names): 80–85%
Medical/clinical content: 77–82%

Primary failure mode — language detection edge cases: Docebo’s AWS Transcribe integration uses automatic language detection on audio submission. When the detected language confidence score falls below a threshold, the job defaults to English, which is usually correct for most corporate training content. However, content that includes extended code-switching (switching between languages mid-module), content recorded by non-native English speakers with heavy first-language interference in their phoneme production, and content that includes extended quotation of non-English regulatory text can all trigger the language detection edge case. When detection fails, the resulting caption track may apply the wrong language model to segments of the audio, producing gibberish output that is sometimes harder to correct than a simple substitution error because the word boundaries themselves are wrong.

Secondary failure mode — speaker diarization errors propagating into caption structure: AWS Transcribe’s speaker diarization is a differentiating feature compared to the other platforms in this comparison, but it is also a source of unique errors. When diarization misidentifies speaker changes — for example, treating a presenter’s change in vocal register (speaking quietly vs. normally) as a speaker change, or failing to detect a genuine speaker change during a rapid-fire Q&A segment — the caption track structure reflects those misidentifications. Caption cues attributed to the wrong speaker label, or caption cues that break mid-sentence at a false speaker boundary, require correction that is structurally different from simple word substitution.

Docebo advantage — transcript review panel: Docebo is the only platform in this comparison that surfaces the generated transcript for review before it is applied as a live caption track. The review step is not mandatory (administrators can skip it and apply the transcript immediately), but its existence means that a workflow that includes transcript review before publication is natively supported in the platform UI without requiring a file export-edit-reimport cycle. For organizations that want to implement a review gate for auto-generated captions, Docebo’s transcript panel is a useful workflow hook. See the Docebo captions guide for the full caption management workflow.

Bottom line: Docebo produces the highest raw accuracy scores of the seven platforms on standard soft-skills and compliance content, and its transcript review panel is the best native workflow support for a correction process. For technical content with domain-specific vocabulary, accuracy drops to the same range as the other platforms. The language detection edge case requires attention for multilingual organizations or content recorded by speakers with heavy non-English accent profiles.

TalentLMS

Caption engine: TalentLMS uses Google Speech-to-Text for its automatic transcription feature. The integration appears to use the “video” recognition model in Google STT, which is optimized for recorded video content rather than live streaming audio. TalentLMS is the platform in this comparison where auto-caption behavior varies most significantly by subscription tier; the evaluation here reflects the TalentLMS Business and Enterprise tiers.

How to activate: In TalentLMS, navigate to the Course → Unit containing the video → open the video unit settings → click “Generate Captions.” For video content hosted on TalentLMS’s native video player, the caption generation job runs automatically. For content embedded from YouTube, Vimeo, or other external hosts, TalentLMS uses the external platform’s caption track if one exists and generates its own track if not. Processing is fast compared to other platforms (typically 0.5× to 0.8× video duration) due to the streaming transcription configuration.

Accuracy profile:

Soft-skills content (single speaker, clear audio): 88–91%
Multi-speaker content (panel discussions, interviews): 80–86%
Compliance content (regulatory vocabulary): 82–86%
Technical training content (product/feature names): 77–82%
Medical/clinical content: 73–79%

Primary failure mode — speaker change handling and caption reset: TalentLMS’s Google STT configuration does not include speaker diarization. When the transcription engine detects an audio event that it interprets as a new speaker (which includes sudden audio level changes, brief silences, and some consonant clusters that the model treats as a speaker boundary marker), it resets the streaming transcription context. This context reset means that the model has less prior-sentence context to use for resolving phoneme ambiguities in the post-reset segment, producing a brief spike in substitution errors at speaker boundaries. For single-speaker content with consistent audio levels, this is a minor issue. For multi-speaker content — panel discussions, manager-employee role plays, moderated Q&A segments, simulated customer service scenarios — the context reset problem occurs at every speaker transition and is the primary source of accuracy degradation in those segments.

Secondary failure mode — external video embed dependency: TalentLMS’s handling of YouTube-embedded content means that courses built with YouTube video links rely on YouTube’s auto-captions rather than TalentLMS’s captioning pipeline. YouTube auto-captions have their own accuracy profile, their own failure modes, and their own caption track format (TTML wrapped in SBV, not SRT or VTT). Organizations that have standardized on TalentLMS for caption compliance purposes but use YouTube as their primary video host need to manage YouTube caption compliance through the YouTube platform, not through TalentLMS. Compliance audit trails, caption export records, and WCAG verification must be maintained at the YouTube level for YouTube-hosted content. See the TalentLMS captions guide for the full workflow including the YouTube embed caption management approach.

Bottom line: TalentLMS performs in the middle of the group on single-speaker content. Multi-speaker content is the platform’s weak point due to the context-reset behaviour, and the YouTube embed dependency is a compliance programme architecture concern for organizations with YouTube-hosted video libraries.

Canvas by Instructure

Caption engine: Canvas uses a Kaltura integration for media management in most institutional deployments, and Kaltura’s auto-captioning feature uses AWS Transcribe as its ASR backend. Institutions that have configured Canvas with Kaltura as the media host (the most common configuration in higher education) receive auto-captions via this Kaltura–AWS Transcribe pipeline. Institutions using direct video uploads to Canvas without Kaltura access a different pipeline that varies by institution configuration — some route through a different AWS Transcribe integration directly, some use Instructure’s own media service.

How to activate: In Canvas, access the Media Gallery or Course Media → select a video → navigate to the “Captions” option → choose “Order Machine Captions.” Processing time varies by institution configuration and Kaltura tier, typically 1× to 2× video duration for standard tier configurations. Generated captions appear in the video player automatically on job completion; Kaltura exposes a caption editor (accessible through the Actions menu → “Edit Captions”) for post-generation review.

Accuracy profile:

Soft-skills content (clear audio, neutral accent): 88–92%
Lecture content (academic vocabulary, technical terms): 82–87%
Compliance and policy content: 83–87%
Technical/domain-specific content: 79–84%
Medical and clinical content: 75–81%

Primary failure mode — Kaltura configuration variation: Because Canvas auto-captions depend on the institution’s Kaltura configuration, the accuracy and feature set available varies between institutions in a way that the other platforms in this comparison do not exhibit. Institutions on Kaltura Legacy tiers use an older AWS Transcribe model version. Institutions on Kaltura VOD™ Enterprise tiers with the Kaltura Caption and Enrich service have access to a Kaltura-proprietary post-processing layer. The practical implication is that two Canvas installations at different institutions can produce measurably different auto-caption accuracy on the same video. Accuracy benchmarking must be performed at your specific institution’s configuration, not extrapolated from benchmarks published for Canvas generally.

Secondary failure mode — caption timing precision: The Kaltura–AWS Transcribe pipeline produces timing that is often more precisely synchronized to word-level audio events than the other platforms in this comparison. This is a strength on most content. However, for content with deliberate dramatic pauses, background music, or sound effects (instructional video with audio design elements, scenario-based training with ambient sound), the word-level timing can create caption breaks that feel choppy to the learner because they do not correspond to natural reading units. The caption editing experience is needed to group cues into natural reading units after auto-generation. The Canvas LMS captions guide covers the Kaltura caption editor workflow and the SRT sidecar upload path that bypasses the auto-caption pipeline.

Bottom line: Canvas’s Kaltura integration produces competitive accuracy for lecture content in higher education settings and good timing precision. The configuration dependency means your results will reflect your specific Kaltura tier and configuration rather than any general Canvas benchmark. Multi-institution L&D programmes that span Canvas and another LMS need separate accuracy benchmarking per platform per institution.

Blackboard (Anthology Ultra)

Caption engine: Blackboard’s auto-captioning for video content runs through Ally, Blackboard’s accessibility feature module. Ally uses Google Speech-to-Text (v1 in most institutional deployments as of Q2 2026, with v2/Chirp rollout underway). Ally’s transcription is triggered by file upload to a Blackboard course content area and is designed for the higher education lecture capture use case, which shapes its accuracy profile for corporate training content in ways that are worth understanding.

How to activate: In Blackboard Ultra, add a video file to course content → Ally processes the file automatically if institutional Ally configuration has auto-transcription enabled → captions appear on the video player once Ally processing completes. If auto-transcription is not enabled at the institutional configuration level, captions must be triggered manually through the Ally dashboard. Ally also provides an Originality Report feedback indicator to learners showing accessibility features available on the file, including whether a transcript is available.

Accuracy profile:

Lecture-style academic content (single speaker, prepared presentation): 87–92%
Conversational instruction content: 83–88%
Corporate compliance content: 81–86%
Technical/corporate training vocabulary: 75–82%
Medical/clinical content: 72–79%

Primary failure mode — corporate vocabulary underperformance: Ally’s Google STT v1 configuration was optimized for higher education use cases. The model has stronger priors for academic vocabulary (discipline-specific terminology in humanities, social sciences, and STEM fields commonly taught in undergraduate courses) and weaker priors for corporate training vocabulary. Terms that are uncommon in academic speech but common in corporate L&D — sales methodology names (MEDDIC, SPIN, Challenger), HR system names (Workday HCM, SuccessFactors, BambooHR), corporate acronyms (OKR, RACI, KPI, NPS), and organizational structure vocabulary (business unit, centre of excellence, skip-level, span of control) — are outside Ally’s accuracy zone. Accuracy on corporate training content in Blackboard is typically 3–5 percentage points lower than accuracy on the same content in platforms whose ASR model was trained more heavily on business vocabulary.

Secondary failure mode — institutional configuration dependency: Like Canvas, Blackboard’s caption quality is a function of institutional Ally configuration. Institutions that have enabled Ally’s “Enhanced Transcription” service receive meaningfully higher accuracy than those using Ally at the standard tier. The enhanced tier uses a more recent Google STT model and applies Ally’s own post-processing layer. Standard-tier Ally accuracy figures are what most medium-sized institutions have, and they are meaningfully lower than enhanced-tier results. Check your Blackboard Ally tier before drawing conclusions from accuracy benchmarks. See the Blackboard captions guide for the institutional configuration considerations and the alternative caption delivery path.

Bottom line: Blackboard Ally produces good results for lecture-style academic content in its target use case. For corporate L&D programmes run on Blackboard in the higher-education market, the academic vocabulary bias and institutional configuration dependency mean that the auto-caption starting point will require more correction on corporate training vocabulary than the accuracy figures for academic content would suggest.

Brightspace by D2L

Caption engine: Brightspace’s auto-captioning configuration varies more than any other platform in this comparison. D2L supports three different caption integration architectures: (1) Kaltura integration (same pipeline as the Canvas Kaltura setup described above), (2) D2L’s own Video Note transcription service for content captured through the Video Note tool, and (3) direct upload transcription via a Brightspace-native ASR integration that routes to AWS Transcribe in most configurations. Which pipeline your Brightspace deployment uses depends on your institution’s integration choices and D2L contract configuration.

How to activate: In Brightspace, the caption activation path depends on which media integration is active. For Kaltura-integrated deployments: Media Library → video record → “Order Machine Captions.” For D2L Video Note: content is automatically transcribed when recorded. For direct upload: Content → Add a File → video file → Caption tab → “Generate Captions” (where available in the institutional configuration). The caption editor interface varies between the three configurations.

Accuracy profile (Kaltura-integrated configuration):

Soft-skills content: 87–91%
Compliance content: 82–86%
Technical training content: 78–83%
Medical/clinical content: 74–80%

Accuracy profile (D2L Video Note / native ASR):

Soft-skills content: 85–89%
Compliance content: 79–84%
Technical training content: 74–80%
Medical/clinical content: 70–77%

Primary failure mode — configuration fragmentation: The most significant accuracy risk in Brightspace is not a specific ASR failure mode but the configuration fragmentation between the three integration paths. Content captured in Video Note gets the Video Note transcription accuracy profile. Content uploaded from external production gets the Kaltura or native-ASR accuracy profile, depending on configuration. Multi-source course content — a course that mixes instructor-recorded Video Note segments with professionally produced SCORM video — may have mixed accuracy profiles within a single course. L&D teams that assume all Brightspace auto-captions perform identically are likely to underestimate the correction burden on Video Note content (which is typically lower quality) and overestimate it on Kaltura-delivered content.

Secondary failure mode — D2L Video Note audio quality interaction: Video Note content is typically recorded with a webcam microphone or a built-in laptop microphone in a home office or office environment with ambient noise. The Audio Quality problem described in the remote and hybrid async video captioning post applies directly here: accuracy on high-quality studio audio is substantially higher than accuracy on typical Video Note audio. A 5–8 percentage point accuracy penalty is common for Video Note content recorded in standard office conditions compared to the same content recorded with a headset microphone and minimal background noise. See the Brightspace captions guide for the full caption management workflow including per-integration-path considerations.

Bottom line: Brightspace’s accuracy is a direct function of which integration path is active and what audio quality the source content has. Kaltura-integrated Brightspace performs at the Canvas/Kaltura benchmark; Video Note transcription performs meaningfully lower. For compliance programme planning, treat these as two separate caption pipelines with different accuracy baselines rather than a single “Brightspace accuracy” figure.

Cross-platform accuracy comparison

The table below summarises the accuracy ranges observed in the evaluation, organised by content type. All figures are word accuracy (100% minus WER), expressed as a percentage range across the test corpus. The WCAG 2.1 AA target (99%) is shown for reference.

Platform	ASR Engine	Soft-Skills	Compliance	Technical	Medical
WCAG 2.1 AA target	—	99%	99%	99%	99%
Docebo	AWS Transcribe (Whisper)	90–94%	85–89%	80–85%	77–82%
Cornerstone OnDemand	Whisper-family (internal)	90–93%	84–87%	78–83%	74–80%
Canvas (Kaltura)	AWS Transcribe (Kaltura)	88–92%	83–87%	79–84%	75–81%
Workday Learning (<15 min)	Google STT v1	88–92%	82–86%	76–82%	72–78%
TalentLMS (single speaker)	Google STT	88–91%	82–86%	77–82%	73–79%
Blackboard Ally (standard)	Google STT v1	87–92%	81–86%	75–82%	72–79%
Brightspace (Kaltura)	AWS Transcribe (Kaltura)	87–91%	82–86%	78–83%	74–80%
Workday Learning (>15 min)	Google STT v1	83–87%	78–84%	73–79%	69–75%
TalentLMS (multi-speaker)	Google STT	80–86%	77–83%	73–79%	69–76%
Brightspace (Video Note)	D2L native / AWS Transcribe	85–89%	79–84%	74–80%	70–77%

The gap between any cell in this table and the 99% WCAG target represents the error rate that requires human correction before the caption track is compliant. On technical training content, every platform in this comparison has a gap of at least 15 percentage points. On medical content, the gap is at least 17 percentage points. Even on soft-skills content — the best-case scenario for every platform — the gap is at least 7 percentage points. Closing that gap requires human review time that must be budgeted as part of the caption production workflow.

For context on what these accuracy levels mean in terms of correction volume, the hidden FTE cost post models the correction labour at different accuracy baselines. A 90% accurate track on a 10-minute module contains approximately 60 errors in a 3,000-word module; at 4× real-time correction speed, that is 40 minutes of correction time per module. At 80% accuracy on the same module, the error count is approximately 600 (assuming the 10% error rate compounds with adjacent errors creating structural errors that require segment rewrites rather than word substitutions), and correction time can exceed 90 minutes per module.

Failure-mode patterns by platform

Beyond the raw accuracy figures, each platform has a distinctive failure-mode fingerprint that affects how correction effort should be prioritised and what content types are most at risk.

Proper noun substitution (all platforms, most severe in Cornerstone and Workday)

Every platform in this comparison substitutes proper nouns with acoustically similar common words. The pattern is universal because it reflects the general-purpose ASR model’s vocabulary probability distribution, not a platform-specific bug. What varies between platforms is the frequency and severity: Cornerstone and Workday show the most consistent proper noun substitution on product names and organizational unit names; Docebo shows the highest language-detection-driven proper noun failure on accented English content.

The impact of proper noun substitution on compliance training is particular: regulatory act names (HIPAA, OSHA, FINRA), regulatory body names (DOJ, EEOC, NLRB), and compliance procedure names (Form I-9, SF-86, MSDS) all fall into the high-risk zone. A compliance training course that teaches employees about HIPAA privacy obligations but produces captions where “HIPAA” is rendered as “hippo” throughout the module is not providing accessible content to hearing-impaired learners, regardless of the word-level accuracy percentage on general vocabulary. The proper noun failure modes post covers the taxonomy of proper noun errors and the glossary architecture required to prevent them.

Long-form accuracy degradation (most severe in Workday, present in all platforms)

All ASR models exhibit some accuracy degradation over the course of a long audio file. Workday’s Google STT v1 integration shows this most dramatically in the test corpus: a consistent 4–6 percentage point accuracy decline between the first half and second half of 30-minute modules. Cornerstone, Canvas, and Brightspace show smaller but measurable degradation (1–3 percentage points). TalentLMS shows minimal long-form degradation on single-speaker content but pronounced degradation on multi-speaker content with frequent speaker transitions.

The practical implication: when reviewing auto-generated caption tracks for correction, prioritise the second half of long-form content first. A review workflow that processes captions linearly from beginning to end will allocate proportionally too much review time to the first half (where accuracy is highest and corrections are least needed) and too little to the second half (where accuracy is lowest and errors are most concentrated). For the DCMP spot-check protocol, sample from the last third of long-form content rather than evenly distributing sample windows across the full duration.

Speaker change handling (most severe in TalentLMS and all platforms without diarization)

Of the seven platforms, only Docebo (via AWS Transcribe’s speaker diarization) natively handles multi-speaker content with explicit speaker labelling. All other platforms treat audio as single-speaker by default. When a non-diarized engine encounters a genuine speaker change — transition from instructor to learner voice in a role-play, Q&A moderator to panellist, narrator to subject-matter expert — it handles the transition using its silence detection and context window behaviour, which is not optimised for the task. TalentLMS’s streaming configuration makes this the most visible failure pattern on that platform; Cornerstone and Canvas handle speaker transitions more smoothly but still without explicit speaker labels.

For organizations producing training content with multiple speakers — interview-format learning, moderated panel sessions, facilitated discussion capture, manager-employee role-play scenarios — Docebo is the only platform in this comparison that addresses the speaker attribution problem at the ASR layer. Even Docebo’s diarization has a 15–20% speaker change misidentification rate in the test corpus, but this is substantially better than the alternatives.

Timing synchronization errors (most variable in Brightspace, most precise in Canvas)

Timing accuracy is a separate dimension from word accuracy and is required for WCAG SC 1.2.2 compliance (“synchronized” captions). Canvas’s Kaltura pipeline produces the most precise word-level timing synchronization in the test corpus. Brightspace’s Video Note pipeline produces the most timing errors, driven by the audio quality variation in Video Note recordings. Cornerstone’s long caption block grouping creates display timing issues even when word-level timing is accurate, because the caption block is on screen longer than the reading time for the block.

Timing errors compound with word errors in the learner experience: a caption track with 90% word accuracy and poor timing synchronization is harder for a hearing-impaired learner to use than a caption track with 85% word accuracy and accurate timing, because the synchronization error means the learner cannot use context clues from the video image to resolve ambiguous words.

Decision framework: when LMS-native auto-captions are acceptable

Based on the accuracy profiles and failure-mode patterns in this comparison, the following framework guides the decision between using LMS native auto-captions as a starting draft (with correction) versus bypassing the auto-caption pipeline and using a professional captioning service directly.

When LMS-native auto-captions are an acceptable starting draft

All four of the following conditions must be met:

Content type is general soft-skills or broadly accessible business vocabulary. The module covers communication, leadership, interpersonal skills, general business processes, or other content that does not contain product names, regulatory acronyms, domain-specific technical vocabulary, or medical/clinical terminology. If the content would require a subject-matter expert to write a glossary for it, it is outside the soft-skills category.
Audio quality meets minimum standards. The recording uses a quality headset or directional microphone, the recording environment has minimal background noise and echo, and the speaker maintains consistent distance from the microphone throughout the recording. Home-office recordings with laptop built-in microphones, recordings with HVAC noise or room echo, and recordings with significant volume variation from a moving speaker typically fall below the minimum quality threshold for acceptable auto-caption accuracy.
Duration is under 15 minutes per video file. Long-form accuracy degradation affects all platforms, most severely Workday. For content over 15 minutes, auto-caption accuracy in the second half of the module is statistically lower than in the first half on every platform evaluated. The 15-minute threshold avoids the degradation zone on most platforms.
A mandatory human review step is built into the publication workflow before learner access. This is the non-negotiable requirement. Auto-generated captions that are published to learners without human review are compliance exposures regardless of content type or audio quality. The review must include reading the caption text against the audio (not just skimming), checking synchronization at 3–4 points across the duration, and verifying any proper nouns, acronyms, or technical terms that appear in the content.

If all four conditions are met, LMS native auto-captions provide a time-efficient draft that a reviewer can correct in a fraction of the time it would take to produce a caption track from scratch. The correction session is still mandatory; the auto-caption draft reduces the total time investment by eliminating the initial transcription step.

When a professional captioning service is required

Any one of the following conditions makes LMS native auto-captions an inadequate starting point:

Content contains domain-specific vocabulary. Technical training, compliance training with regulatory terminology, medical or clinical training, product training with feature names, legal or financial training with specialized vocabulary — all of these require a vocabulary glossary that LMS native auto-caption pipelines cannot apply. The substitution error rate on domain vocabulary will not reach 99% through correction of a sub-80% auto-caption track without correction effort that costs more per minute than a professional captioning service.
Content will be published without pre-publication human review. Any workflow where auto-generated captions are live to learners without a correction step is a compliance workflow, not a cost-saving measure. The auto-caption track will contain errors at a frequency that is incompatible with WCAG 2.1 AA regardless of platform.
Content is longer than 30 minutes. Long-form accuracy degradation compounds the correction burden on the second half of long content to the point where a professional service starting from a ground-truth transcript is more efficient than correcting a degraded auto-caption track.
Content has multiple speakers or non-neutral accent profiles. Multi-speaker content without diarization and non-neutral accent content both produce accuracy profiles that are 5–10 percentage points below the single-speaker neutral-accent baseline. The gap to WCAG 2.1 AA on this content type is too large for an auto-caption draft to be the efficiency-maximizing starting point.
Content carries regulatory compliance training obligations with audit documentation requirements. HIPAA security training, OSHA hazard communication training, FINRA-required broker-dealer training, Section 508-covered government training — any content where a compliance failure creates regulatory exposure rather than just accessibility liability should be captioned with a documented DCMP-protocol accuracy verification, which requires a professional service with a measurement and reporting capability that LMS native auto-caption pipelines do not provide.
Content is being remediated for ADA Title II or Title III compliance. Back-catalogue remediation programmes targeting ADA Title II compliance (higher education) or Title III compliance (public accommodation) require documented accuracy verification and an audit trail that demonstrates WCAG 2.1 AA compliance. An LMS native auto-caption track does not come with accuracy documentation. A professional captioning service that provides DCMP-protocol accuracy reports per module provides the audit evidence that a compliance programme requires.

For organizations that want the efficiency of automation without the accuracy gap, the alternative to LMS native auto-captions is a captioning service API that applies a custom vocabulary glossary before delivering the caption track. The glossary-corrected output from a professional service API is processed in the same time window as LMS native auto-captions (minutes to hours) but at a target accuracy of 97–99% on domain-specific content rather than 73–85%. The comparison between these two paths — and the unit economics of when each is appropriate — is the subject of the caption ROI post for finance executives.

The caption compliance programme post covers the decision framework for where in the content production workflow to apply automated captioning versus professional captioning, and how to assign content types to each path in the programme design phase rather than making ad hoc decisions at the module level.

The correction labour cost the auto-caption ROI calculation misses

The most common mistake in LMS native auto-caption ROI calculations is treating auto-captions as having zero cost because they are included in the LMS subscription. The auto-caption feature is included in the subscription. The correction labour required to bring the auto-generated track to WCAG compliance is not included, and it is not free.

Correction labour model for LMS native auto-caption tracks

The following model uses conservative inputs to estimate correction labour cost:

Content volume: 30 modules per month, average 20 minutes per module, average 3,000 words per module
Auto-caption accuracy: 85% (optimistic estimate for soft-skills content on a mid-tier platform)
Error count per module: 85% accuracy on a 3,000-word module = 450 word errors per module
Correction speed: DCMP-protocol correction at 4× real-time = 80 minutes of correction per 20-minute module
Monthly correction hours: 30 modules × 80 minutes = 2,400 minutes = 40 hours per month
Coordinator fully-loaded hourly cost: $45/hour
Monthly correction labour cost: 40 hours × $45 = $1,800/month

At the same content volume, a professional captioning service at $1.50/minute costs: 30 modules × 20 minutes × $1.50 = $900/month, delivering 99%+ accuracy with a product glossary applied and no coordinator correction time. The professional service costs 50% less than the auto-caption-plus-correction workflow and produces a more accurate output.

This crossover point shifts depending on the accuracy of the auto-caption output, the volume of content, and the coordinator’s time cost. But the structural insight holds across a wide range of inputs: when the correction labour is counted at its actual cost, LMS native auto-captions are not free. They shift the cost from a line item in the vendor budget to a non-visible line item in the coordinator’s workload. The hidden half-FTE cost model provides the detailed calculation framework for organizations that want to size this labour cost for their specific content volume and accuracy baseline.

The accuracy tipping point

Correction labour cost is not linear with error count. At 90% accuracy, errors are mostly isolated substitutions that can be corrected with a single word replacement per error. At 80% accuracy, error clusters become common: a substitution error in one word creates a downstream context error in the next phrase, so the correction requires rewriting a sentence segment rather than replacing a word. At 75% accuracy, structural errors (multiple consecutive errors creating unintelligible segments) require re-listening to the audio and retyping from scratch — at which point the correction workflow is not faster than transcription from scratch. The accuracy-to-correction-time relationship is roughly linear down to 85–88%, then accelerates steeply below that threshold. Technical training content on most of the platforms in this comparison falls in the 73–85% range — below the inflection point where correction efficiency degrades most sharply.

For organizations that want to use LMS native auto-captions as a starting draft for efficiency, the implication is: limit the auto-caption draft workflow to content types and platform configurations that produce 88%+ accuracy. For everything else, the labour cost of correcting the draft exceeds the labour cost saved by not transcribing from scratch, and a professional service is the more efficient path.

Eight failure modes in LMS native auto-caption workflows

Assuming auto-captions are compliant because they exist. The presence of a caption track on a video in an LMS does not satisfy WCAG 2.1 AA. The track must be accurate (99% word accuracy by DCMP protocol), synchronized, and complete. An auto-generated track at 82% word accuracy is not compliant. Organizations that report “100% caption coverage” based on the presence of auto-generated tracks without accuracy verification are reporting caption coverage, not compliance. The distinction matters when a learner files an ADA complaint or when an OCR audit requests compliance documentation. The auto-captions compliance status post covers the regulatory basis for why auto-captions require human review before compliance claims can be made.
Applying a flat per-minute correction time estimate to all content types. L&D teams that budget correction time as a flat 30 minutes per 10-minute module will underestimate correction effort on technical content (where error density is higher and structural errors are more common) and overestimate it on soft-skills content (where isolated word substitutions are easier to correct). Build content-type-specific correction time estimates into the production workflow rather than applying a universal estimate.
Publishing auto-generated captions immediately on job completion without review. All seven platforms evaluated here make the auto-generated caption track available to learners immediately on job completion. None require an administrator review step before learner access. The default platform behaviour is to publish the unreviewed track. Compliance requires a workflow override that holds the video in a review state (not learner-accessible) until the caption track has been reviewed and approved. This is a configuration and workflow decision, not a default feature.
Benchmarking accuracy on general content and assuming domain-specific content performs comparably. Auto-caption accuracy on a 10-minute soft-skills module from a communication training library is not predictive of accuracy on a 10-minute HIPAA security training module from the same LMS. The gap between best-case and worst-case accuracy across content types is 8–18 percentage points on the platforms in this comparison. Benchmark accuracy on representative content from each content type in your specific library, not on sample content provided by the LMS vendor.
Failing to account for configuration variation between LMS deployments. Canvas, Blackboard, and Brightspace auto-caption performance depends significantly on institutional Kaltura or Ally configuration. Accuracy figures from another institution’s Canvas deployment or from an LMS vendor demo environment are not applicable to your specific configuration. Test the auto-caption pipeline on your deployment before committing to it as a compliance strategy.
Using long-form auto-captions without auditing the second half of the module. Long-form accuracy degradation affects every platform, most severely Workday. A QA review that samples only the first 10 minutes of a 45-minute module will see 88% accuracy and flag the track as close to compliant; sampling the second 20 minutes may reveal 78% accuracy and structural errors. Full-module QA, or DCMP spot-check samples weighted toward the latter two-thirds of long-form content, is required for reliable compliance assessment of long-form auto-captioned content.
Treating LMS native auto-captions and external captioning service output as equivalent for audit documentation purposes. A professional captioning service that provides per-module DCMP accuracy reports with reference transcript documentation provides audit-ready compliance evidence. An LMS native auto-caption job does not provide any accuracy documentation. When a compliance audit, an ADA complaint, or an OCR investigation asks for evidence that a specific module meets WCAG 2.1 AA, “we used the LMS auto-caption feature” is not a sufficient response. Compliance documentation requires either a DCMP-protocol accuracy report or a professional captioning service statement of accuracy per job.
Ignoring the glossary gap when evaluating auto-caption sufficiency for a content category. The decision to use LMS native auto-captions for a content category is often made based on a surface review of a few sample modules. Those samples may perform acceptably if they happen to avoid the specific product names, acronyms, or technical terms that the ASR engine handles poorly. When the next batch of modules includes a newly launched product feature, a recently introduced compliance acronym, or a presenter who uses industry-specific terms at high frequency, accuracy drops and the decision to use auto-captions looks worse. Evaluate auto-caption sufficiency across the full range of vocabulary that will appear in the content category, not a selected subset. The glossary architecture post covers how to build the vocabulary map that makes this evaluation possible.

FAQ

Our LMS vendor told us their auto-captions are “AI-powered” and meet WCAG 2.1 AA. Why does your comparison show they don’t reach 99%?

“AI-powered” describes the technology used to generate the captions, not their accuracy or compliance status. WCAG 2.1 AA requires captions to be accurate, synchronized, and complete. The accuracy requirement — 99% by DCMP Captioning Key protocol — is a measured outcome, not a property of using AI. An AI-generated caption track at 83% accuracy is not WCAG 2.1 AA compliant regardless of how the technology is described. LMS vendors use “AI-powered” to differentiate auto-caption features from manual captioning workflows, not as a claim of specific accuracy performance. When evaluating vendor claims, ask for the specific accuracy percentage on domain-specific content comparable to your training library, measured by word error rate against a reference transcript, not a general statement about technology capability. The caption vendor accuracy evaluation post covers the questions to ask and the methodology to use for evaluating any vendor’s accuracy claims — including your LMS vendor’s auto-caption feature.

Can we use LMS native auto-captions as a first draft and then send to a captioning service for correction? Is that more efficient?

This workflow — auto-caption first, then send the draft to a professional captioning service for correction rather than from-scratch transcription — is called “clean-up” or “correction-from-draft” mode. Most professional captioning services offer it and price it at a discount from their standard rate (typically 30–50% less per minute than from-scratch transcription) because the correction workflow is faster than full transcription when the draft accuracy is above 85%. For technical content where LMS native auto-caption accuracy lands at 73–85%, the draft quality may not be high enough to produce the correction-from-draft efficiency gain, and the professional service may produce more accurate results starting from scratch than correcting a heavily degraded draft. Check with your captioning service provider on the minimum accuracy threshold they require for their correction-from-draft workflow to be faster than from-scratch transcription. Most providers set this threshold at 85–88%. For content categories where LMS native auto-captions reliably exceed that threshold, the hybrid workflow makes sense. For content below it, use the professional service directly. This is worth modelling with your own content sample rather than taking a general rule, because the threshold depends on the character of the errors: a 83% accurate draft with uniformly distributed single-word substitutions is easier to correct than an 83% accurate draft with concentrated structural errors in specific segments.

We are switching LMS platforms from Blackboard to Canvas next year. Should we re-caption our existing library or can we use the caption tracks from Blackboard?

If your existing Blackboard caption tracks were produced by Blackboard Ally auto-captions, they carry the accuracy profile of that pipeline (87–92% for soft-skills, 75–82% for technical content). Moving those tracks to Canvas does not change their word accuracy; the same errors present in the Blackboard-generated tracks will be present in the Canvas player. The question is whether you should take the LMS migration as an opportunity to replace auto-generated tracks with higher-accuracy professionally captioned tracks. The answer depends on: (1) whether the existing tracks have been reviewed and corrected to WCAG compliance levels already — if so, those tracks are compliant and can be migrated; (2) whether your compliance programme documentation treats the existing tracks as compliant or as “draft auto-captions” — if the latter, migration is the trigger for the compliance correction that was deferred; and (3) the content priority framework you would use for back-catalogue remediation. The LMS migration caption checklist post covers the full caption data migration process including format verification, sidecar file transfer, timing verification after LMS import, and the priority framework for deciding which modules need recaptioning versus which can be migrated as-is.

Our training content is entirely soft-skills and communication modules with high-quality studio audio. Our LMS auto-captions hit 91% on a sample we tested. At 91% accuracy, how much correction time should we budget per module?

At 91% accuracy on a 10-minute module with clean studio audio and approximately 1,500 words: 9% error rate × 1,500 words = 135 word errors per module. At 4× real-time correction speed for a trained caption reviewer: 4 × 10 minutes = 40 minutes to review and correct the full module. In practice, soft-skills content at 91% accuracy tends toward isolated single-word substitutions rather than structural errors, so correction is closer to the efficient end of the 4× real-time range and may be achievable in 30–35 minutes for a fluent reviewer. At 30 modules per month at 10 minutes average duration: 30 modules × 35 minutes correction = 17.5 hours/month. At a coordinator cost of $45/hour fully loaded: $787.50/month in correction labour. Compare this to a professional service at $1.50/minute × 300 minutes = $450/month at higher accuracy with no coordinator correction time required. The auto-caption-plus-correction workflow is still more expensive in total cost at this volume even at 91% accuracy, but the difference is small enough that the workflow choice may be determined by factors other than cost — for example, if your coordinator prefers to retain caption editing as a quality control step, the labour cost difference is modest and the quality outcome may be better. If you have more than 100 modules per month, the cost gap widens significantly in favour of the professional service.

We use Canvas with Kaltura. How do we get the accuracy benchmarks for our specific configuration rather than relying on general Canvas figures?

Run the evaluation on your specific deployment using the following methodology: (1) Select 10–15 videos representative of your content library: 3–4 soft-skills modules, 3–4 compliance modules, 3–4 technical or product training modules. (2) Produce a ground-truth transcript for each video. This means listening to the video and typing exactly what is said — including filler words, false starts, and re-starts — without editing for clarity. This is the reference transcript. Budget 3–4 hours per 10 minutes of content. (3) Enable auto-captions on each video through the Kaltura media gallery → order machine captions workflow. Export the resulting SRT or VTT file. (4) Align the auto-caption text with the reference transcript using any of the standard WER calculation tools (sclite, jiwer in Python, or a manual alignment spreadsheet). Calculate WER as: (substitutions + insertions + deletions) ÷ reference word count × 100%. (5) Record results by content category. The pattern across your content types is more informative than a single average figure. The caption QA methodology post describes the DCMP spot-check protocol as an alternative to full WER calculation if you want a faster estimate using a random sample rather than a full module evaluation. The spot-check approach requires less reference transcript preparation time and still produces a reliable accuracy estimate for compliance assessment purposes.

We have a mixed content library — some modules are soft-skills, some are technical, some are compliance. Is there a way to set a platform-level policy rather than making case-by-case decisions?

Yes. A tiered content categorisation policy is the standard approach for organizations with mixed content libraries. The policy assigns each module to one of three tiers at the course catalogue level: (1) Tier A — LMS native auto-captions with mandatory pre-publication review: soft-skills content, communication training, general business skills, content with no acronyms, product names, or domain-specific vocabulary. Review gate required before learner access. (2) Tier B — professional captioning service with glossary: compliance training with regulatory terminology, product training with feature names, any content with an existing glossary term set, content that will receive regulatory compliance audit documentation. (3) Tier C — professional captioning service with dedicated glossary and DCMP accuracy documentation: medical and clinical training, legal and financial services training, OSHA and safety procedure training, government and Section 508-covered content, any content where a captioning failure creates regulatory liability beyond ADA accessibility obligation. This tiered policy eliminates case-by-case decisions by classifying content at intake rather than evaluating each module individually. New content is assigned to a tier in the content inventory system when the module is registered. The caption compliance programme design post covers how to implement this tiered classification in a formal caption programme with RACI, governance, and audit trail requirements.

Can we use the auto-caption tracks from our LMS as a starting point for training a custom ASR model that performs better on our content?

Custom ASR model training requires ground-truth transcripts (reference transcripts with verified word-level accuracy) as training data, not auto-generated caption tracks. Auto-generated tracks contain errors at the frequency documented in this post; training a model on erroneous transcripts produces a model with a learned bias toward those errors rather than a model that corrects them. To use your content library as ASR fine-tuning data, you would need to produce human-reviewed, DCMP-accuracy-verified transcripts for a minimum viable fine-tuning corpus (typically 50–200 hours of domain-specific audio with verified transcripts, depending on the model architecture). The practical path for most L&D organizations is not model fine-tuning (which requires ML engineering capability and ongoing maintenance) but vocabulary glossary configuration at a professional captioning service API layer. A captioning service glossary with 200–500 domain-specific terms achieves the practical benefits of model domain adaptation — correct product names, regulatory acronyms, and technical terms — in a matter of hours rather than months and without model training infrastructure. The glossary-biased captioning post covers the technical mechanism by which vocabulary glossaries improve accuracy on domain-specific content without model retraining, and the glossary architecture post covers how to structure the term set for maximum accuracy gain with minimal maintenance overhead.

Get 99%+ accuracy on your LMS training content with a glossary built from your vocabulary

The accuracy gap between LMS native auto-captions and WCAG 2.1 AA — 5 to 27 percentage points depending on content type and platform — is not a gap that human correction can close efficiently at scale. A reviewer correcting 80% accurate captions on a 30-minute technical training module spends more time than the module took to record. The gap is closed by a vocabulary glossary: a mapping of the phoneme sequences your ASR engine fails on to the correct domain-specific terms, applied at the post-processing layer before the caption file is delivered. Every correction session your L&D coordinator spends fixing “cornerstone on demand” for “Cornerstone OnDemand,” or “hippo” for “HIPAA,” or your product name for whatever phonetic approximation the ASR model chose, is a correction that a vocabulary glossary would have prevented automatically on every subsequent module.

GlossCap’s per-customer glossary is built from your organization’s terminology sources: your LMS course catalogue (product names and system names that appear in course titles), your internal documentation (acronym dictionaries, policy manuals, procedure guides), your product changelog and release notes (feature names in their correct spelling and capitalization), and your historical caption correction log (patterns of errors that have required correction on past modules). A 300-term glossary covering your organization’s core vocabulary raises accuracy from the 73–85% LMS native baseline on technical content to 97–99% — without model fine-tuning, without manual correction sessions, and without the LMS platform knowing anything about your vocabulary. The glossary updates on your content production cadence: when you launch a new product feature, add the feature name to the glossary once, and every subsequent module that mentions the feature receives a correct caption. Compare the approach at Rev vs GlossCap and 3Play vs GlossCap, see glossary-corrected caption output on technical training content in the embed widget, or start with the Team plan at $99/month which includes per-customer glossary configuration, LMS sidecar delivery (SRT and VTT), Cornerstone, Workday, Docebo, TalentLMS, Canvas, Blackboard, and Brightspace integrations, and a DCMP-protocol accuracy report on the first batch of your content.

Replace your LMS auto-captions with a glossary-corrected pipeline