Inclusion Operations · Published 2026-06-21
Captioning DEI training programmes: inclusive language vocabulary, pronoun representation, and the glossary architecture for culturally sensitive content
DEI training video carries the same WCAG 2.1 AA captioning obligation as any other mandatory workplace training distributed through an employer-controlled channel. What makes DEI training different — the reason it merits a dedicated captioning playbook — is not the subject matter itself but the vocabulary accuracy profile it presents to automatic speech recognition systems. DEI training video contains a specific set of vocabulary categories that combine in ways not found in any other L&D content type: community identity acronyms coined after most ASR training corpora were assembled, pronoun constructions that post-processing tools may actively degrade, register shifts between formal professional speech and vernacular language varieties that ASR handles unevenly, cultural and geographic proper nouns from communities underrepresented in training data, and a rapidly growing ecosystem of DEI software product names that have entered the market recently enough to be absent from every major ASR model's training distribution. The consequence is that DEI training captions, when produced using a standard captioning workflow without DEI-specific vocabulary preparation, will have higher error rates than the same organisation's technical training captions — and the errors will fall disproportionately on the terms that carry the most meaning in the content.
The vocabulary profile of DEI training is distinct from both general corporate content and technical training content in a specific structural way: the failure-rate terms in DEI content are not arbitrary technical jargon that a learner might reasonably interpret from context. They are identity terms — names for communities, pronouns for individuals, cultural practices and events — where a mistranscription is not merely a comprehension problem but a representation failure. When a speaker says "BIPOC" and the caption reads "by poke," the learner reading the caption is receiving a phonetic approximation of an identity acronym that the speaker used with specific precision. When a speaker's pronoun introduction is transcribed with the wrong pronouns, the training content that was designed to model inclusive behaviour is delivering an accurate demonstration of the opposite through its accessibility layer. When a tribal nation name is phonetically approximated, the cultural specificity the speaker intended is erased in the text version of the content. These error types are different from a drug name misspelled in a healthcare training caption or a MITRE ATT&CK technique ID garbled in a cybersecurity training caption — they are errors in the identity vocabulary of the people the training is designed to discuss, and they carry a signal about organisational competence in that domain that is hard to ignore.
This post is written for L&D directors and accessibility coordinators who have completed the initial phase of their caption compliance programme — technical training, compliance training, onboarding content are captioned and performing at or near the WCAG 2.1 AA 99% accuracy target — and are now rolling out captioning for their DEI training library, or auditing existing DEI captions for the first time. It covers: what makes DEI vocabulary specifically challenging for ASR (not DEI ideology — vocabulary accuracy mechanics); which inclusive language terms reliably pass current ASR and which consistently fail; how pronoun representation works and fails in caption workflows; code-switching and register variation accuracy gaps; cultural and geographic specificity; the growing DEI software product name vocabulary; glossary architecture decisions for DEI content; QA protocols adapted for DEI-specific failure modes; the compliance framework; eight failure modes; and seven frequently asked questions. The internal links throughout the post connect to the broader GlossCap playbook where specific topics — glossary architecture, QA methodology, compliance frameworks — are covered in depth.
TL;DR — five vocabulary failure categories in DEI training captions
- Community identity acronyms (BIPOC, LGBTQIA+, AAPI, MENA): these are the highest-failure-rate vocabulary items in DEI content. They are formed from letters that correspond to words in the DEI domain, not to commonly co-occurring words in ASR training corpora, and they were coined after most ASR training distributions were assembled. Every community identity acronym used in a DEI training programme must be added to the caption glossary before production starts.
- Pronoun constructions — especially neopronouns: singular they/them is transcribed accurately by current ASR but can be degraded by post-processing grammar normalisation tools. Neopronouns (ze/zir, xe/xem, ey/em) have near-zero representation in ASR training data and will be phonetically approximated — "sir" for "zir," "hem" for "xem," "a" for "ey" — unless handled through glossary injection or human transcription for the relevant segments.
- Code-switching accuracy gaps: DEI content frequently features speakers shifting between formal professional speech and vernacular language varieties. Research and ongoing evaluation of Whisper-class models show persistent accuracy gaps on African American English features compared to Standard American English. This creates an equity-within-equity problem: the content designed to address racial equity may deliver a lower-accuracy caption experience for content featuring Black speakers using AAE features.
- Cultural and geographic specificity: tribal nation names, diaspora community vocabulary, religious community terms, and historical event references that are underrepresented in ASR training data require glossary support. The failure mode here is phonetic approximation that may significantly distort the name of the community being referenced.
- DEI software product names: the DEI tech stack (CultureAmp, Textio, Workday Peakon, Leapsome, Betterworks, 15Five, Diversio, Seramount) has been built largely after major ASR training corpora were assembled. These names fail at rates comparable to other recently coined software product names, with compound-word splitting and phonetic approximation as the dominant failure modes.
What makes DEI training different for captioning
DEI training video has the same WCAG captioning obligation as any other mandatory workplace training under ADA Title I. If a DEI module — unconscious bias training, allyship fundamentals, inclusive leadership, belonging at work, anti-discrimination code of conduct — is distributed to employees through an employer-controlled LMS as part of mandatory or strongly encouraged professional development, the WCAG 2.1 AA 99% accuracy standard applies on the same terms as compliance training, technical training, or any other pre-recorded video content. There is no DEI exemption from caption accuracy requirements, and there is no mechanism by which the content subject matter modifies the accessibility obligation. What makes DEI training different from other content categories is not its legal treatment — it is the specific vocabulary accuracy challenges it presents to automatic speech recognition systems and the specific consequences when those challenges are unaddressed.
Five distinct vocabulary characteristics of DEI training content
DEI training content has five vocabulary characteristics that combine to create an accuracy profile distinct from any other L&D content type.
1. Inclusive language terminology with variable ASR coverage. DEI training vocabulary includes a range of inclusive language terms with dramatically different ASR performance profiles. Established terms — "intersectionality," "microaggression," "allyship," "implicit bias," "systemic racism" — are well-represented in recent ASR training corpora and perform reliably. But a specific set of community identity acronyms — "BIPOC," "LGBTQIA+," "AAPI," "MENA" — were coined after most ASR training distributions were assembled, are formed from letters corresponding to DEI-domain words rather than commonly co-occurring speech patterns, and fail at high rates in current models. The L&D team reviewing a DEI training caption file may not immediately identify these failures because the phonetic output ("by poke," "LGBTQ eye a") passes a casual reading but fails a precision accuracy check.
2. Pronoun patterns outside standard ASR training distributions. DEI panel discussions, speaker introductions, and facilitated discussions frequently include pronoun introductions ("my pronouns are they/them"), singular they/them usage, and — for content featuring speakers who use them — neopronouns (ze/zir, xe/xem, ey/em). Singular they/them is transcribed accurately by current ASR but can be corrupted by post-processing grammar normalisation tools that "correct" it. Neopronouns have near-zero representation in any ASR training corpus and will be phonetically approximated to the nearest common English word — "sir," "zero," "hem," "hay" — producing caption text that is both inaccurate and potentially offensive in context.
3. Code-switching between language registers. DEI training content, particularly facilitated discussions, panel conversations, and testimonial-format modules, frequently features speakers who shift between formal professional speech and vernacular language varieties within the same content or within the same sentence. This register variation is a feature of authentic speech from many communities, not a defect to be corrected. ASR systems are trained primarily on formal speech corpora, and their accuracy on vernacular varieties — particularly African American English — is systematically lower than their accuracy on Standard American English. This differential accuracy creates a specific equity concern for DEI captioning: the training content designed to address racial equity may itself deliver a less accurate caption experience for content featuring Black speakers using AAE features.
4. Cultural and geographic specificity underrepresented in ASR training data. DEI training content frequently references cultural communities, geographic communities, historical events, and cultural practices whose names may not appear in ASR training distributions with sufficient frequency to produce accurate transcriptions. Tribal nation names (Haudenosaunee, Anishinaabe, Mvskoke, Diné), diaspora community vocabulary (Afro-Latinx, Filipinx), religious holiday names (Nowruz, Vesak), and LGBTQ+ cultural vocabulary (two-spirit) all have variable accuracy profiles that require pre-production assessment and glossary support.
5. DEI programme product names absent from ASR training distributions. The DEI technology ecosystem — engagement survey platforms, job description bias analysis tools, pulse survey tools, pay equity analytics, inclusion measurement platforms — has been built largely over the past five to eight years, after the training corpora underlying most current ASR models were assembled. Product names like CultureAmp, Textio, Workday Peakon, Leapsome, Betterworks, 15Five, Diversio, and Seramount will be unknown to most ASR models and will be rendered as phonetic approximations ("culture camp," "text I.O.," "peak on," "Sarah mount") that are comprehensible to no one.
Why the accuracy consequences are different from technical content
In technical training content — software product training, cybersecurity awareness, compliance regulation review — ASR errors on technical terms are comprehension failures. When "CVE-2024-21762" is misrendered, a learner may not understand the specific vulnerability being discussed. That is a significant accuracy problem, but the error falls on a technical identifier that most learners are encountering for the first time and cannot self-correct from context.
In DEI training content, ASR errors on vocabulary terms are frequently representation failures. When a speaker says "intersectionality" and the caption reads "intersection ality" (word-boundary error), the technical precision of a conceptual term that Kimberlé Crenshaw developed with specific analytical intent is lost. When a speaker's pronoun introduction — "my pronouns are ze/zir" — is transcribed as "my pronouns are see sir" or "my pronouns are C, zero," the training content that was designed to model inclusive pronoun practice is demonstrating, through its accessibility layer, exactly the failure mode it was designed to address: a person's pronoun being replaced with something that is not their pronoun. When a tribal nation name is rendered as a phonetic approximation of three syllables, the cultural specificity of that community's self-identification is erased in the text version of content that was designed to support cultural awareness. The errors are not neutral — they occur on the most meaning-bearing terms in the content, and they carry an implicit signal about whether the organisation's DEI commitment extends to its caption production practices.
Who this post is for
The intended reader is an L&D director or accessibility coordinator at an organisation that has completed the initial phase of its caption compliance programme — technical training, compliance training, onboarding, and sales enablement content are captioned and meeting WCAG 2.1 AA accuracy targets — and is now extending captioning to the DEI training library. Many organisations captioned their high-volume, high-visibility training content first and are now reaching the DEI library as part of a second-phase rollout. Others are conducting a full LMS caption audit that has surfaced accuracy problems in existing DEI captions. Either path leads to the same set of technical questions: which DEI vocabulary items need glossary support, how do pronoun patterns work in captioning workflows, what does QA look like for DEI-specific failure modes, and how does the glossary architecture for DEI content fit with the existing caption infrastructure.
Inclusive language vocabulary: what breaks and what doesn't
The most important operational insight for DEI captioning is that inclusive language terminology does not have a uniform ASR accuracy profile. A significant portion of the DEI vocabulary canon performs reliably in current ASR models — Whisper-large and its descendants, as well as competing foundation-model-based ASR systems — because these terms are established enough to have substantial representation in the training corpora assembled over the past several years. Another portion fails consistently, for specific structural reasons. Understanding which terms fall into which category allows an L&D team to focus glossary and QA resources on actual failure points rather than attempting to add the entire DEI vocabulary to the caption glossary, which would create latency overhead with no accuracy benefit for terms that already transcribe correctly.
For a broader analysis of how vocabulary characteristics predict ASR failure rates across content categories, see the Whisper accuracy benchmarks by vertical post, which covers the methodology for predicting per-vocabulary-category failure rates from training corpus representation data.
Terms that perform well in current ASR
The following DEI terms are well-represented in recent ASR training corpora and transcribe accurately without glossary support in current models. Do not add these to the caption glossary — unnecessary glossary entries create substitution risk and processing overhead with no accuracy benefit.
- Core DEI terminology: "diversity," "equity," "inclusion," "belonging" — extremely high frequency across all content types; robust in any ASR model. "DEI" as a three-letter acronym generally performs well, rendered as "DEI" or "D-E-I" consistently.
- Concept vocabulary: "allyship," "microaggression," "intersectionality," "privilege," "implicit bias," "unconscious bias," "bystander intervention," "psychological safety" — all well-established in training content vocabulary and well-represented in recent corpora. These terms entered mainstream professional vocabulary in the 2010s with sufficient lead time to appear in ASR training data.
- Structural discrimination vocabulary: "systemic racism," "structural racism," "institutional racism," "systemic bias" — high frequency in professional and media content since 2015; robust in current models.
- Identity-based discrimination terms: "ableism," "sexism," "racism," "homophobia," "transphobia," "xenophobia," "antisemitism" — all well-represented in current models; no glossary support needed.
- Workplace conduct vocabulary: "harassment," "discrimination," "retaliation," "accommodation," "reasonable accommodation" — these appear in both compliance training and DEI training content and are thoroughly represented in legal and HR content corpora.
- LGBTQ+ established vocabulary: "lesbian," "gay," "bisexual," "transgender," "queer" (as identity term), "nonbinary," "gender identity," "sexual orientation," "cisgender" — note that "cisgender" is an important exception; see the failing terms list below for why "cis-" prefix constructions are variable.
Terms that consistently fail in current ASR
The following terms have documented high failure rates and require glossary entries before DEI training content is submitted for captioning. For each term, the common ASR output is specified — these are not hypothetical; they represent the actual outputs produced by Whisper-large and comparable models on DEI training audio.
BIPOC (Black, Indigenous, and People of Color): This acronym has an ASR failure rate that is among the highest for any DEI term. Common outputs include "by poke," "bipock," "B-I-P-O-C" (spelled out as individual letters), and "be pock." The phonetic sequence /baɪpɒk/ maps poorly to any common English word, and the acronym's emergence in mainstream usage circa 2020 means it postdates most ASR training corpora. The glossary entry should specify the canonical form "BIPOC" and, where the captioning system supports it, provide the phonetic hint /baɪpɒk/.
LGBTQIA+: The full expansion including QIA and the plus sign almost never renders correctly. Common outputs include "LGBTQ" (truncated), "LGBT QIA" (split), "LGBTQ I A" (individual letter rendering), and "LGBTQ plus" (partial with explicit plus). The glossary canonical form should be "LGBTQIA+" and the L&D team's style guide should specify whether "LGBTQ+" (shorter form) is acceptable in contexts where the full expansion is not used by the speaker.
Cisgender: Despite being listed in the well-performing vocabulary above as a concept, the "cis-" prefix in "cisgender" creates consistent problems. Common ASR outputs include "this gender," "this center," "cis gender" (word-boundary error splitting the compound). The root issue is that "cis-" as a prefix for gender identity is underrepresented in ASR training data relative to "cis-" in other contexts (cisternae, cistern, cis-regulatory). The word "cisgender" should be in the DEI glossary. Note that "cisnormative," "cissexism," and "cis-het" will also fail if they appear in the content.
Latinx: A phonetically unusual construction — the "-x" suffix applied to a Spanish loanword — that renders as "Latin ex," "Latino X," "Latin X," or "latinks." The word has also been partly superseded by "Latine" in some usage contexts (see below), which introduces additional transcription variability because both forms now appear in DEI content. The glossary entry for "Latinx" should specify canonical capitalisation ("Latinx," not "LatinX") and the L&D team's style guide should clarify whether the org's current usage preference is "Latinx," "Latine," "Latino/a," or context-dependent.
Latine: An even more recent term than Latinx, used in some Spanish-speaking communities as a gender-neutral alternative to Latino/Latina that uses the "-e" ending rather than the "-x" ending (which is not native to Spanish phonology). Current ASR output for "Latine" is typically "Latin," "la teen," or "la teen eh." This term requires a glossary entry and also requires that the QA reviewer be aware that it is a distinct term from "Latina," "Latino," and "Latinx."
Neurodiverse / Neurodivergent: These compound words are frequently split at the word boundary: "neuro diverse" and "neuro divergent" rather than "neurodiverse" and "neurodivergent." The word-boundary error matters because it changes searchability (a learner searching the transcript for "neurodiverse" will not find "neuro diverse"), affects how screen readers render the word, and may affect downstream processing in learning management systems that parse caption text for metadata. Both terms should be in the glossary.
Desi: As a community self-identification term for people of South Asian origin, "Desi" is rendered by current ASR as "daisy," "desi" (occasionally correct), or "D.C." (individual letter rendering). The phonetic sequence /deɪsi/ is identical to the name Daisy, which is far more common in ASR training data, making the correct disambiguation impossible without context injection via glossary.
AAPI (Asian American and Pacific Islander): Rendered as "A-A-P-I" (spelled out), "happy" (phonetic approximation of /eɪeɪpiːaɪ/ compressed), or "a API" (treating the second and third letters as the API acronym). The glossary entry for "AAPI" should specify the canonical form and expansion.
MENA (Middle East and North Africa): Rendered as "mean a," "Mina" (a proper name), or "M.E.N.A." (individual letters). The four-letter sequence has multiple phonetic interpretations and low frequency in ASR training data with the DEI meaning.
AuDHD (autism + ADHD co-occurrence): A very recent coinage — the term has entered mainstream DEI and disability community vocabulary only in the last three to four years — that is not present in most ASR training distributions. Common outputs: "odd HD," "OD HD," "audHD." If this term appears in the DEI training content (particularly in content addressing neurodiversity and disability inclusion), it must be in the glossary.
AFAB/AMAB (Assigned Female At Birth / Assigned Male At Birth): These terms, used primarily in DEI content addressing transgender and intersex identity, are rendered as phonetic guesses. The four-letter acronym sequences are not in most ASR training distributions and will produce unpredictable outputs. Human transcription for segments containing these terms is a viable alternative to glossary injection where the captioning vendor's glossary system does not reliably handle rare acronyms.
Neopronouns — ze/zir, xe/xem, ey/em: These are covered in detail in the pronoun representation section below. Summary failure modes: "ze" → "zee," "C," or "B"; "zir" → "sir," "zero," "seer"; "xe" → "zee," "he," "C"; "xem" → "them," "hem," "stem"; "ey" → "they," "a," "hey"; "em" → "him," "them," "am." Every neopronoun used by a featured speaker must be addressed before production through glossary injection or human transcription.
Terms with variable performance
The following terms have inconsistent ASR performance — accurate in some contexts, inaccurate in others — which makes them harder to address with a simple glossary add/don't-add decision.
DEIB (Diversity, Equity, Inclusion, and Belonging): "DEI" performs well; the extension to "DEIB" is less consistent, rendered as "D-E-I-B" (sometimes correctly), "de-ib," or "D.E.I.B." If the organisation's programme uses "DEIB" as its programme acronym, the glossary entry with canonical form is recommended.
JEDI (Justice, Equity, Diversity, Inclusion): The phonetic sequence /dʒɛdi/ is identical to the Star Wars reference, which is extremely high-frequency in ASR training data. Without context injection, "JEDI" in a DEI training context will be rendered as "Jedi" (Star Wars capitalisation). When the organisation's DEI programme uses "JEDI" as its acronym, the glossary entry with the capitalisation and expansion "JEDI (Justice, Equity, Diversity, Inclusion)" disambiguates from the Star Wars reference.
ERG (Employee Resource Group): Rendered as "erg" (the physics unit, as in ergs of energy) or "E-R-G" (individual letter spelling). The physics unit "erg" is more common in text corpora than the DEI programme acronym, making the physics interpretation the default. The glossary entry for "ERG" with the expansion prevents the physics substitution in DEI training content. Similarly, BRG (Business Resource Group) renders as "B-R-G" or "brig" (nautical jail) and should be in the glossary.
The structural pattern underlying the consistently failing terms: acronyms formed from community or identity names — BIPOC, LGBTQIA+, AAPI, MENA — consistently underperform because the letters correspond to words in the DEI vocabulary domain rather than to commonly co-occurring words in ASR training corpora. The model has no basis for predicting that the sequence /biːaɪpiːoʊsiː/ should render as "BIPOC" rather than as phonetically similar common words. This is the same structural failure mode that produces CVE identifier errors in cybersecurity training and chemical CAS number errors in OSHA training — identifier-class tokens that exist outside normal linguistic prediction. For DEI content, the remedy is the same as for those content categories: pre-production glossary population before the first captioning job is submitted. See the proper noun failure modes post for the full taxonomy of identifier-class vocabulary failures across content categories.
Pronoun representation in caption text
Pronoun representation in DEI training captions involves three distinct technical challenges that require different operational responses: singular they/them transcription, pronoun introductions in panel settings, and neopronouns. Each challenge has a different root cause and a different solution.
Singular they/them: the transcription is accurate; the risk is post-processing
Singular "they" when used as a third-person pronoun is grammatically standard in contemporary English and has been used in singular contexts for centuries — the Oxford English Dictionary cites singular "they" examples from the fourteenth century, and its use as a gender-neutral pronoun for non-binary people is now documented as standard in the AP Stylebook, the Chicago Manual of Style, and the APA Publication Manual. Current ASR systems transcribe singular "they" correctly because the transcription task is literal: the model outputs the word that appears in the audio. "Alex shared their experience. They told us about their journey." — if that is what the speaker said, current ASR will produce that output. The pronoun itself is phonetically identical regardless of whether it is used in the singular or plural sense, so there is no acoustic difference for the model to handle differently.
The transcription is accurate. The risk is in post-processing. Two post-processing failure modes corrupt accurate singular they/them transcriptions after ASR has completed its work:
Grammar correction tools applied after ASR: Any captioning workflow that includes an automated grammar normalisation step risks "correcting" accurate singular they/them constructions. Grammar-checking APIs, word processors used in manual review workflows (Microsoft Word's grammar checker, for example, has flagged singular "they" as a potential error in some versions), and some CAT (computer-assisted translation) tools used in multilingual captioning workflows may suggest changing "they shared their experience" to "he or she shared their experience" or to "he shared his experience" when the software attempts to resolve what it interprets as an antecedent-agreement error. The result is a caption that says something the speaker did not say, involving pronouns the speaker did not use, for a person the speaker was not describing in those terms. Before processing any DEI training content, audit the captioning workflow for any grammar normalisation step and confirm that singular they/them is excluded from that step. If the grammar correction tool does not support selective rule disabling, the grammar correction step should be removed from the DEI content workflow entirely.
Manual reviewer bias: A caption reviewer who is unfamiliar with singular they/them as grammatically correct — or who is aware of it but applies a style guide that does not endorse it — may "correct" accurate caption text during the manual review stage. The QA reviewer training for DEI content must explicitly establish singular they/them as grammatically correct standard English that should not be modified in caption review. The training should include specific examples so that reviewers recognise the construction and understand the organisational policy: accurate singular they/them in caption text is not an error to be fixed; it is an accurate transcription that must be preserved.
Pronoun introductions in panel settings
DEI panel discussions, facilitated workshops, and community listening sessions frequently include pronoun introductions as part of the speaker's self-introduction: "My name is Jordan Collins, and my pronouns are they/them." The ASR challenge in this sentence is not "my name is Jordan Collins" (ASR handles common names reliably) and not "my pronouns are" (standard phrase, well-represented in training data). The challenge is the pronoun pair itself.
The word "they/them" — spoken aloud — sounds like "they them" or "they, them" (spoken with a brief pause after the comma). There is no audible slash. The ASR model must decide how to render two words that the speaker said in a specific sequence, in a context that implies they are a linked pair. Common outputs include "they them" (no separator), "they, them" (comma separator), and occasionally "they/them" (forward slash — correct conventional form) when the model has seen this construction frequently enough in training data. The organisation's caption style guide should specify the canonical rendering for pronoun pairs: "they/them" with forward slash is the most widely used convention in professional DEI content, and the style guide specification ensures consistent rendering across all DEI captions regardless of which ASR output the model produced. The QA step for DEI captions includes confirming that all pronoun pair constructions use the canonical separator.
A secondary challenge in multi-speaker DEI panel content: when the same pronoun pair appears multiple times across different speakers' introductions, the QA reviewer must verify that each instance is associated with the correct speaker in the speaker-labelled caption file. Auto-speaker-diarization systems identify speakers by acoustic features, not by name or pronoun. If the caption file uses speaker labels ("SPEAKER 1:", "JORDAN:", etc.), those labels must be assigned correctly to each speaker's audio, and the label assignment should not infer a pronoun from the speaker's name. "Jordan" is a gender-neutral name; the label "JORDAN:" does not imply any pronoun, and the QA step should confirm that no speaker label in a DEI panel caption file is being used to infer pronoun assignment where none was made by the speaker.
Neopronouns: the highest-failure-rate vocabulary category in DEI captions
Neopronouns — gender-neutral pronoun sets that are not the standard "he/him," "she/her," or "they/them" — are the highest-failure-rate vocabulary category in DEI captions. The most commonly encountered neopronouns in DEI training content are ze/zir (sometimes ze/hir), xe/xem, and ey/em. All of them share a structural problem that cannot be addressed through context-based prediction: they are short, phonetically simple words whose phonetic sequences are identical or near-identical to other common English words.
"Ze" is phonetically /ziː/ — identical to the letter Z, or to various phonetic approximations of "see" and "the" in different phonetic contexts. Common ASR outputs: "zee," "C," "B," "the." "Zir" is /zɪr/ — very close to "sir" (/sɪr/), the difference being only the initial fricative. Common ASR outputs: "sir," "zero," "seer," "stir." "Xe" is /ziː/ — identical to "ze" phonetically in most usage, meaning both pronouns produce the same set of ASR approximations. "Xem" is /zɛm/ or /hɛm/ depending on the speaker's pronunciation; common ASR outputs: "them," "hem," "stem," "gem." "Ey" is /eɪ/ — identical to the letter A, or to "hey" without the h. Common ASR outputs: "they," "a," "hey," "A." "Em" is /ɛm/ — the letter M, or "him" without the h. Common ASR outputs: "him," "them," "am," "m."
The only reliable solution for neopronoun transcription is pre-production intervention. There are two viable approaches: glossary injection and human transcription for neopronoun segments. Glossary injection requires that the captioning vendor's glossary system support short phonetically ambiguous entries — not all glossary implementations can reliably distinguish /ziː/ as "ze" rather than "zee" in context, because the disambiguation requires either phonetic context or semantic context that may not be available to the substitution system. Before relying on glossary injection for neopronouns, test the vendor's glossary system with a sample clip featuring the specific neopronouns used by featured speakers in the DEI content. If the glossary system cannot reliably produce the correct output, human transcription for segments where neopronouns appear is the operationally safer choice.
The pre-production protocol for DEI content featuring neopronoun-using speakers:
- Identify all speakers whose pronouns are neopronouns before production starts — not after the caption file comes back with errors.
- Consult with the DEI team on the canonical spelling for each neopronoun set used (ze/zir vs. ze/hir; xe/xem vs. xyr/xym).
- Add glossary entries for each neopronoun with the canonical form, or specify human transcription for segments where they appear.
- Include an explicit neopronoun accuracy check in the QA step — verify each instance of each neopronoun against the audio.
- Add the neopronoun forms to the caption style guide so that future DEI content is handled consistently.
The glossary architecture post covers the technical implementation of phonetically ambiguous glossary entries, including the approach for single-syllable terms where the phonetic form is shared with common English words.
Code-switching and register variation in DEI facilitation
Code-switching refers to the practice of shifting between language varieties, registers, or languages within the same interaction or content segment. In DEI training content, code-switching is most commonly observed as shifts between formal professional speech and vernacular language varieties — particularly African American English (AAE) — within the same speaker's contributions to a panel discussion, facilitated workshop, or testimonial module. The term also covers the use of non-English words and phrases by bilingual speakers in multilingual community contexts: Spanish words used by Latinx speakers, Indigenous language greetings at the opening of a session, Yiddish terms in discussions of Jewish identity, and similar cases where a speaker's authentic expression includes vocabulary from a language other than English.
ASR accuracy and African American English
The most consequential code-switching accuracy challenge for DEI training captioning involves African American English. AAE is a rule-governed dialect of American English with its own phonological, morphological, syntactic, and lexical features that are distinct from Standard American English. Features that appear in DEI training content include: habitual "be" marking an ongoing or habitual state ("they be doing that," meaning "they habitually do that," which is distinct from "they are doing that"); copula deletion ("they late" for "they are late"); specific patterns of vowel production; final consonant cluster reduction; negative concord; and particular intonation patterns. These are features of a well-documented language variety, not errors to be corrected.
Current ASR systems perform with systematically lower accuracy on AAE features than on equivalent content in Standard American English. The 2020 study by Koenecke et al., "Racial Disparities in Automated Speech Recognition," published in the Proceedings of the National Academy of Sciences, found that five major ASR systems had approximately twice the word error rate (WER) on Black speakers compared to white speakers in matched demographic groups — and that the WER gap was attributable primarily to linguistic features associated with AAE (phonological and grammatical patterns), not to acoustic differences between speakers. The study used Amazon Transcribe, Apple Dictation, Google Speech-to-Text, IBM Watson, and Microsoft Azure Speech at the time of publication.
Since that study, foundation-model-based ASR systems — particularly OpenAI Whisper and its descendants — have been trained on substantially larger and more diverse corpora, and more recent evaluations have found reduced but persistent accuracy gaps on AAE and other non-prestige English varieties compared to Standard American English. The gap has narrowed; it has not closed. For DEI training captioning, this means that content featuring Black speakers using AAE features will, on average, produce lower-accuracy caption output than equivalent content featuring speakers using Standard American English phonology and grammar. The magnitude of the gap varies by content, by the specific AAE features present, and by audio quality.
The practical implication for DEI programme captioning creates a specific equity concern: the training content whose explicit purpose is to address racial equity and inclusion may itself deliver a less accurate caption experience for learners who are deaf or hard-of-hearing and depend on captions to access content featuring Black speakers. This is not a hypothetical concern — it is a structural consequence of how current ASR systems were trained, and it requires a specific operational response.
What to do about differential accuracy by speaker
The caption QA methodology post covers the standard DCMP spot-check protocol. For DEI training content, that protocol should be stratified: run the QA spot-check with a sample that represents the full range of speakers in the DEI training library, not just a random sample. If the DEI library includes content featuring speakers with diverse language backgrounds and dialect usage, the QA sample should proportionally represent that diversity so that accuracy can be assessed across the full range of content.
If stratified QA reveals a systematic accuracy gap by speaker — content featuring certain speaker groups consistently produces lower accuracy — the next step is investigation before intervention. Audio quality is a significant confounding factor. Content featuring Black speakers that was recorded in less controlled acoustic environments (a community listening session recorded with omnidirectional microphones rather than a studio production with directional lavalier microphones) will have lower accuracy because of the audio quality difference, not because of the speaker's language variety. The first step in investigation is controlling for audio quality. If content with equivalent audio quality shows systematic accuracy differences by speaker, the difference is more likely attributable to ASR training distribution bias.
The intervention options, in order of increasing cost:
- Enhanced QA for affected content: Identify content with persistent accuracy gaps and apply a higher QA sampling rate — 10% or 20% spot-check rather than the standard 3–5% — and more rigorous correction for any errors found.
- Human review for affected segments: For content where ASR consistently underperforms below the WCAG 2.1 AA 99% accuracy threshold, designate those segments for human review rather than relying on the standard post-ASR correction workflow.
- Vendor escalation: If the accuracy gap is contractually significant — below the SLA threshold for accuracy — escalate with the captioning vendor. Document the specific content, the speaker profile, the audio quality, and the measured accuracy. If the vendor's SLA guarantees 99% accuracy and DEI content featuring certain speaker groups is systematically producing 95% or 96%, the SLA breach is a contractual matter.
For the audio quality dimension, the guidance from the remote and hybrid workforce captioning post applies equally here: the single most impactful improvement in ASR accuracy for any content is audio quality at recording time. DEI training content that is recorded in controlled acoustic environments with quality directional microphones will produce better ASR accuracy for all speakers, which is both a caption quality improvement and an equity improvement. The audio quality investment benefits all speakers, but it specifically reduces the environmental acoustic gap that may otherwise amplify (or be confounded with) the ASR training distribution gap.
Code-switching to non-English languages
DEI training content frequently includes non-English words or phrases used naturally by speakers who are bilingual or who are drawing on a specific cultural or linguistic tradition. Common patterns include: Spanish words and phrases used by Latinx speakers in discussions of cultural identity; Indigenous language greetings (Haudenosaunee: "Sgënö," Diné: "Yá'át'ééh," Anishinaabe: "Aaniin") at the opening of a session or when a speaker introduces themselves; Yiddish terms ("chutzpah," "mensch," "shtetl") in discussions of Jewish identity and culture; and Arabic phrases ("mashallah," "inshallah") in discussions of Muslim identity.
ASR systems trained on English handle non-English audio segments in one of three ways: (a) phonetically approximate the non-English audio as the nearest-sounding English words — producing false transcription that neither matches the audio nor provides intelligible English text; (b) produce garbled or fragmented output; or (c) in the case of Whisper-based systems, attempt language identification for the segment and either switch to transcribing in the detected language (producing correct text in the non-English language, which may not be what the learner needs for an English-language caption track) or insert a marker indicating an unrecognised segment.
None of these default outputs are acceptable for an accessible DEI training caption. The correct treatment for non-English audio segments in otherwise English-language DEI training content depends on the nature and length of the segment. For brief code-switching — a single word or phrase used naturally within an English sentence — the recommended caption treatment is to include the word in the correct spelling of the language of origin with, where appropriate, an English translation or gloss in brackets: "[Diné greeting: Yá'át'ééh]." For longer non-English segments, the content should be handled by a human translator/captioner for those segments, or the multilingual captioning workflow described in the multilingual caption workflow post should be applied at the segment level. The same principle applies here as for any multilingual content: automatic English-language ASR applied to non-English audio produces unusable output, and the production workflow must account for non-English segments before the content is submitted for captioning.
Cultural and geographic specificity
DEI training content references cultural communities, geographic communities, historical events, and cultural practices with a frequency and specificity that exceeds most other corporate L&D content types. The accurate transcription of community names, cultural terms, and historical references matters in DEI training for the same reason that accurate pronoun transcription matters: these terms are the names of the communities, traditions, and events that the training is designed to illuminate, and a phonetic approximation or outright mistranscription communicates, through the caption layer, a failure to accurately represent the community being discussed.
Tribal nation and Indigenous community names
Tribal nation names and Indigenous community names present some of the most technically challenging transcription problems in DEI captioning. Many of these names are from languages other than English with phonological features (consonant clusters, vowel qualities, tonal marking, glottal stops) that do not map directly to English phonology and that are entirely absent from English-language ASR training corpora. The standard failure mode is phonetic approximation — the ASR model produces the nearest phonetically plausible English sequence — which may result in output that is significantly different from the correct name.
Specific examples and their common ASR failure modes:
- Haudenosaunee (the name used by the peoples of the Iroquois Confederacy for themselves): a seven-syllable word from the Seneca language that does not have a phonetic equivalent in English. Common ASR approximations: "Haw-dee-no-SAW-nee," rendered as multiple separate words ("Ho deno show knee," "how den a sonic," "haw den a sonny"). Best practice: add to the glossary with the phonetic hint /hɔːdɛnɔːˈʃoʊni/ before any DEI training content referencing this community is captioned.
- Anishinaabe (the people of the Ojibwe, Odawa, Potawatomi, and related nations): five syllables. Common ASR approximations: "Anish-in-abe" (with word-boundary errors), "Annie shin abbey," "uh-nish-uh-nah-bee." The terminal double "e" is frequently dropped. Glossary entry required.
- Mvskoke (the self-designation of the Muscogee Creek Nation): two syllables in the Muskogean language, pronounced approximately /mʌskoʊɡiː/. Common ASR outputs: "miss-coke," "mask-oh-kee," "miss cookie," "Muskogee" (the anglicised form). The anglicised form "Muskogee" is in ASR training data and is the most common accurate-adjacent output; when "Mvskoke" is the form the speaker uses, the glossary entry should specify "Mvskoke" and note that "Muskogee" is a distinct (anglicised) form.
- Diné (the name the Navajo people use for themselves, meaning "the People"): two syllables, /dɪˈneɪ/. Common ASR outputs: "Dinah" (the name), "duh-nay," "din eh," "den ay." The name "Dinah" is significantly more common in ASR training data, making the correct transcription unlikely without glossary support.
- Lakȟóta (the Sioux Nation self-designation): three syllables. The "ȟ" is a retroflex fricative that has no direct English equivalent; in spoken English, it is typically pronounced as a plain /k/, producing "La-KO-ta." The approximation "Lakota" (without diacritics) is the standard anglicised spelling and is well-represented in ASR training data. For DEI training content, "Lakota" without diacritics is an acceptable caption rendering if the speaker is using the standard English pronunciation; if the speaker is using the Lakȟóta language pronunciation with the retroflex fricative, the QA reviewer should confirm the accurate phonetic rendering.
Best practice for tribal and Indigenous nation names: compile a complete list of every tribal nation, Indigenous community, and Indigenous language term referenced in the DEI training library and add all of them to the caption glossary before the first content is captioned. The DEI team, in consultation with Indigenous partners or consultants involved in the training design, is the appropriate source for canonical spellings and diacritics.
Historical event and place names
Most standard historical event names that appear in DEI training content are well-represented in ASR training data because they have high frequency in news, academic, and cultural content corpora. "Selma," "the Civil Rights Act," "the Voting Rights Act," "the Stonewall Uprising," "the Tulsa Race Massacre," "the Harlem Renaissance," "the March on Washington" — these are in ASR training data with sufficient frequency that they transcribe accurately without glossary support. The failure risk is lower for events with high media frequency than for community names with limited English-language media coverage.
Less commonly referenced historical events and places should be verified before content goes to production: community-specific events, local history references, events that are significant within a community but that have not achieved broad media coverage may not be in ASR training distributions. The pre-production vocabulary audit — reviewing the training script or transcript outline and flagging any proper noun whose ASR performance is uncertain — should include historical references as well as community names and product names.
Diaspora vocabulary
Terms used by diaspora communities to describe themselves or their cultural context have variable accuracy profiles in current ASR.
- Afro-Latinx / Afro-Latino: The compound form "Afro-Latinx" combines the already-failing "Latinx" with the prefix "Afro-" that is also less common in casual speech. Common output: "Afro Latin ex," "Afro Latino X," "afro latinx" (inconsistent capitalisation). Glossary entry for "Afro-Latinx" and "Afro-Latino" (as distinct terms with different usage contexts) is recommended if these terms appear in the content.
- Filipino/a/x and Filipinx: "Filipino" and "Filipina" are well-represented in ASR training data and generally transcribe correctly. "Filipinx" — the gender-neutral form using the "-x" suffix — is a recent term that will fail in the same pattern as "Latinx": "Filipino X," "Filipina X," or phonetic approximations. Glossary entry required for "Filipinx" if it appears in the content.
- Desi: Covered in the inclusive language vocabulary section above — phonetically identical to "Daisy" and likely to be transcribed as such.
Religious community vocabulary
DEI training content that addresses religious diversity and inclusion may include terminology from multiple faith traditions. Most major religious holiday and observance names that have achieved broad English-language media coverage are well-represented in ASR training data.
- Eid al-Fitr / Eid al-Adha: The Arabic article "al-" creates inconsistent rendering. "Eid al-Adha" may be rendered as "Eid ul-Adha," "Eid el-Adha," or with the article dropped entirely ("Eid Adha"). The canonical form in the org's style guide should be specified, and the glossary entry should use that canonical form. "Eid" alone is generally transcribed correctly.
- Diwali: Well-represented in recent models following increased media coverage of the Hindu festival. Generally fine without glossary support.
- Juneteenth: High media frequency post-2020 following the establishment of Juneteenth as a U.S. federal holiday. Well-represented in recent ASR training corpora; generally transcribes correctly.
- Passover, Rosh Hashanah, Yom Kippur, Hanukkah: All well-represented in English-language media corpora. Generally transcribe correctly without glossary support.
- Nowruz (Persian New Year, observed by Iranian, Afghan, Kurdish, and Central Asian communities): Less commonly represented in ASR training data. Common outputs: "no roos," "no rooz," "no ruz." Glossary entry recommended if the term appears in content addressing Persian, Iranian, or Central Asian cultural identity.
- Vesak (Buddhist observance commemorating the birth, enlightenment, and death of Gautama Buddha): Low frequency in English-language media. Common outputs: "Vasak," "vesk," "Vee-sak." Glossary entry recommended for DEI training content addressing Buddhist identity and practice.
LGBTQ+ cultural vocabulary
"Stonewall," "Stonewall Uprising," and "Stonewall Inn" are well-represented in ASR training data. "Pride" in the context of LGBTQ+ Pride observances is transcribed correctly (context does not alter the transcription). "Queer" as a reclaimed identity term is transcribed correctly — the word itself is common in ASR training data, and the specific usage context in DEI training (positive identity term) does not affect transcription. "Two-spirit" — an umbrella term used by some Indigenous North American peoples for persons who fulfil a traditional third-gender or other gender-variant ceremonial and social role — is sometimes split: "to spirit," "two spirit," or "to-spirit." If the term appears in the DEI content, the glossary entry specifying "two-spirit" with hyphen ensures consistent rendering. The cultural significance of the term — it is an Indigenous cultural concept, not a synonym for non-binary — should be addressed in the DEI reviewer training so that reviewers understand why accurate transcription matters specifically for this term.
DEI programme product names and vendor terminology
The DEI technology ecosystem has expanded substantially over the past five to eight years. Engagement survey platforms, job description bias analysis tools, pulse survey tools, pay equity analytics, inclusion measurement platforms, and DEI programme management software have proliferated as organisations have invested in measuring and managing DEI outcomes. Training videos about these tools — "how to use CultureAmp for team check-ins," "onboarding with Textio for job description review," "interpreting your Workday Peakon dashboard" — are increasingly common in DEI training libraries. And because most of these products were founded and named after the major ASR training corpora were assembled, their product names are absent from most ASR model training distributions and will fail at high rates.
The failure modes follow the same patterns documented across other recently coined software product names: compound word splitting (treating a portmanteau or two-word compound as separate words), phonetic approximation of proper nouns, and inconsistent capitalisation. The proper noun failure modes post covers the taxonomy in detail; what follows is the DEI-specific product name failure catalogue.
| Product name | Common ASR failure modes |
|---|---|
| CultureAmp | "culture amp" (lowercase, compound), "culture camp," "culture and," "Kull Tramp" (phonetic distortion of compound) |
| Textio | "text e.o.," "text I.O.," "text to," "Textico," "text-io" (hyphenated) |
| Workday Peakon | "Peakon" → "peak on," "peekon," "P. Khan," "peek in," "peek-on" |
| Leapsome | "leap some," "Lee P. Some," "leap sum," "Lee Psalm" (compound split with phonetic distortion) |
| Betterworks | "better works" (compound split, inconsistent), "better-works," "betterworks" (correct but lowercase) |
| 15Five | "fifteen five," "15 five," "fifteen, five," "1 5 5" (digit rendering) |
| Lattice | Usually correct — "lattice" is a common English word with high frequency in corpora; no glossary entry needed unless used in a context where the common-word interpretation would be confusing |
| Rippling | Usually correct — "rippling" is a common English word; same caveat as Lattice |
| Glint (now Microsoft Viva Glint) | "Glint" alone usually fine; "Viva Glint" — "Viva" may be rendered as "vibe a," "viva" (lowercase), or "V.I.V.A."; the full "Microsoft Viva Glint" is more likely to be rendered accurately because "Microsoft" anchors the context |
| Paradigm | Usually correct — "paradigm" is a well-established English word with high academic and professional frequency |
| Diversio | "di versio," "diversion" (phonetically adjacent), "Divot," "diverse E.O.," "diverse io" |
| Seramount (formerly DiversityInc Best Places) | "Sarah Mount," "serement," "sere mount," "Sira mount"; the rebrand is recent enough that the name is not in most ASR distributions |
| Catalyst | Usually correct — "catalyst" is a well-established English word with high frequency in scientific, business, and media content |
| CompTIA (when used in DEI-in-tech contexts) | "comp tia," "comp T I A," "comp-tee-a"; same failure mode as in cybersecurity training content |
| SHRM (Society for Human Resource Management) | "sherm" — generally fine in most ASR contexts because the phonetic form maps to a plausible English word/name; add to glossary for canonical expansion |
Regulatory agency acronyms
DEI training content that addresses employment law, anti-discrimination regulation, and EEO compliance will reference regulatory agencies by their acronyms. These acronyms have generally adequate ASR performance but benefit from glossary entries for canonical form and expansion.
- EEOC (Equal Employment Opportunity Commission): Usually rendered as "E-E-O-C" (with hyphens) or "eec." The acronym is well-represented in legal and HR content; the failure mode is inconsistent formatting rather than incorrect transcription. The glossary entry should specify the canonical rendering ("EEOC" without hyphens or periods) and the full expansion.
- OFCCP (Office of Federal Contract Compliance Programs): A five-letter acronym that is less commonly represented in ASR training data than EEOC. Common output: "O-F-C-C-P" (individual letters). Glossary entry recommended if the content addresses federal contractor compliance and affirmative action obligations.
- OSHA: Well-represented in ASR training data across safety training, HR content, and news media. No glossary entry needed.
- HHS (Department of Health and Human Services): Well-represented in government and healthcare content corpora. No glossary entry needed.
DEI programme acronyms
Beyond the community identity acronyms covered in the inclusive language vocabulary section, DEI training content uses several programme-level acronyms that require glossary entries.
- DEIB (Diversity, Equity, Inclusion, and Belonging): "DEI" performs well; "DEIB" is less consistent. If the organisation's programme uses "DEIB" as its canonical acronym, the glossary entry with the expansion ensures consistent rendering.
- JEDI (Justice, Equity, Diversity, Inclusion): Phonetically identical to the Star Wars reference. Without context injection via glossary, "JEDI" in a DEI training context will render as "Jedi." The glossary entry should specify capitalisation and expansion: "JEDI (Justice, Equity, Diversity, Inclusion)."
- ERG (Employee Resource Group): Rendered as "erg" (the physics unit of energy, equal to 10⁻⁷ joules) or "E-R-G." The physics unit meaning is more common in text corpora than the DEI programme acronym. Glossary entry with expansion required. Note that the organisation's own ERG names — "Spectrum" for the LGBTQ+ ERG, "Mosaic" for a cultural diversity ERG, "Enable" for a disability ERG, etc. — are proper nouns that should also be in the glossary if they appear in DEI training content. These are common English words used as organisation-specific proper nouns, and ASR will transcribe them correctly as common words but not necessarily with the capitalisation that indicates they are programme names.
- BRG (Business Resource Group): Rendered as "B-R-G" or "brig" (nautical jail). Glossary entry with expansion recommended.
The operational principle for DEI product name and acronym management: before any DEI training video that references a specific DEI technology product, regulatory agency, or programme acronym is submitted for captioning, the relevant product names and acronyms should be in the glossary. The failure cost of not adding them — a training video about CultureAmp that is captioned before the glossary is updated, producing "Culture camp" throughout — is higher than the overhead of adding a glossary entry before production. See the hidden FTE cost of caption correction for the downstream cost calculation when caption errors propagate through a training library before they are caught. The same product-name glossary principle that applies to technical training content applies with equal force to DEI training — as documented in the why 99% caption accuracy matters post, the accuracy threshold is set at the level it is because errors on the most meaning-bearing words in content — which for DEI training are identity and product terms — compound into comprehension failures that defeat the purpose of the training.
Glossary architecture for DEI content
The glossary architecture decision for DEI captioning is whether to maintain a separate DEI glossary or integrate DEI terms into the existing L&D caption glossary. This is not a purely technical decision — it has organisational ownership, update cadence, and governance dimensions that the technical architecture should reflect. The glossary architecture post covers the full decision framework; what follows is the DEI-specific application of that framework.
Arguments for a separate DEI glossary
Different update cadence. DEI terminology evolves substantially faster than technical product vocabulary. A new gender identity term may emerge, gain community acceptance, and achieve mainstream DEI training usage within a year. A contested community name may shift — for example, the evolving usage debate around "Latinx," "Latine," and "Latino/a" — creating a need to update glossary entries on a timeline that is significantly faster than the software product update cycle. A product SDK name may be stable for three to five years; a DEI terminology convention may shift within one year. Separate glossaries with separate update triggers allow the DEI glossary to be updated on the faster cadence it requires without affecting the more stable technical product glossary.
Different organisational ownership. DEI terminology decisions involve the DEI team, HR leadership, and ERG leadership — not just the L&D team. The term "Latinx" vs "Latine" is not a technical question that the L&D caption manager can resolve unilaterally — it is a question that should be decided by the DEI team in consultation with the relevant community within the organisation. A separate glossary where DEI-tagged entries are managed by the DEI team, with L&D team access to view and propose but not to approve, respects those ownership boundaries. A single integrated glossary where the L&D team controls all entries creates the risk that terminology decisions are made by people who do not have the organisational standing to make them.
Different sensitivity. Some DEI terms have contested spellings, contested usage, or community-specific meaning that requires deliberate decision-making rather than default technical choices. A DEI-team-owned glossary entry for contested terms ensures that the decision is documented, reviewed, and approved by people with the relevant context. When a glossary entry for "Latinx" specifies the canonical form, that specification should reflect a decision made by the DEI team and ERG leaders, not a default imposed by the L&D caption production workflow.
Arguments for an integrated glossary
Single source of truth. A single integrated glossary eliminates the risk of contradictory entries across glossary systems. If "ERG" is defined with the expansion "Employee Resource Group" in the DEI glossary but is undefined in the main L&D glossary, the captioning system may handle "ERG" differently in content that spans both domains — for example, an onboarding video that introduces both the compliance training requirements and the ERG structure. Contradictory or inconsistently applied entries produce inconsistent caption output across the training library, which creates confusion for learners who encounter the same term rendered differently in different training modules.
Lower maintenance overhead. A single glossary system has lower overall maintenance overhead than two separate systems. Duplicate data entry, duplicate review processes, and the need to synchronise any shared terms across two systems all add operational cost. If the organisation's L&D team is already managing a caption glossary, adding DEI entries to that system is lower overhead than standing up a second system.
Compound accuracy effect. The caption feedback loop and iterative accuracy improvement post describes how the compound accuracy effect works: each glossary entry improves accuracy across all content that uses that term, and the accuracy improvement compounds over time as more content is captioned with the improved glossary. If DEI glossary terms are in a separate system that is not applied when non-DEI content is captioned, the accuracy compound effect from other content types does not benefit DEI vocabulary — even when the same terms appear in onboarding content, compliance training, or internal communications.
Recommended approach: integrated glossary with DEI-ownership tagging
The recommended architecture is an integrated glossary with DEI-ownership tagging rather than separate systems. Every glossary entry includes an owner field. DEI-tagged entries require DEI team approval for creation, modification, or deletion. The L&D team can propose changes to DEI-tagged entries; the DEI team reviews and approves or declines. The L&D team manages all non-DEI-tagged entries. This preserves the single-source-of-truth benefits and the compound accuracy effect from an integrated system while implementing the governance controls that respect DEI team ownership of sensitive terminology decisions. The update cadence for DEI-tagged entries is managed separately — the DEI glossary review cycle operates on a more frequent trigger than the technical product glossary review — but within the same system.
Term sourcing for the DEI glossary
Where do DEI glossary terms come from? The term sourcing process for DEI content has more organisational complexity than the term sourcing for technical product glossaries, which can draw on release notes, SDK documentation, and product naming guides. DEI term sourcing requires engagement with community standards, style guides, and internal organisational decisions.
- The DEI team's internal style guide (if one exists): the canonical source for org-approved terminology. If the DEI team has documented a preferred term for any contested usage (Latinx vs. Latine, person-first vs. identity-first language for disability), the style guide is the authoritative source and the glossary entry should match it.
- GLAAD Media Reference Guide: the most widely used reference for LGBTQ+ terminology in professional and media contexts, updated annually. The GLAAD guide specifies preferred terminology for sexual orientation and gender identity terms, including which terms are considered outdated or offensive. DEI glossary entries for LGBTQ+ terminology should be reviewed against the current GLAAD guide on the organisation's glossary update cycle.
- NCDJ Disability Language Style Guide (National Center on Disability and Journalism): the standard reference for disability-related vocabulary in professional contexts. The NCDJ guide addresses the person-first vs. identity-first debate (some disability communities prefer "disabled person" over "person with a disability") and provides guidance on specific terminology. DEI glossary entries for disability vocabulary should be reviewed against the current NCDJ guide.
- AP Stylebook race and identity section: the AP Stylebook has updated its race and identity guidance in recent years to address evolving usage on terms including "Black," "white," "Indigenous," "Native American," and specific community identity terms. For DEI training content distributed broadly across the organisation, the AP Stylebook provides a widely recognised standard that is updated annually.
- The organisation's DEI software vendor list: every product name in the DEI tech stack should be in the glossary before the first training video about that product is captioned. The vendor procurement process — when a new DEI platform is evaluated and selected — should include a step that adds the product name to the caption glossary. This prevents the failure mode where a training video about a new DEI tool is captioned before the glossary is updated.
- ERG and BRG names: the organisation's own ERG and BRG names are proper nouns that must be in the glossary. Every ERG name — including informal ERG names and programme-specific ERG names that may not appear in official documentation — should be added when the ERG is established. When an ERG changes its name, the glossary entry should be updated before the next DEI training content referencing that ERG is captioned.
What NOT to include in the DEI glossary
Glossary efficiency matters. Every unnecessary glossary entry creates overhead — processing time, potential for unintended substitutions, and maintenance burden during reviews. Terms that ASR handles accurately without glossary support should not be added to the DEI glossary. The following DEI terms do not need glossary entries in current models:
- diversity, equity, inclusion, belonging
- allyship, microaggression, intersectionality, privilege
- bias, implicit bias, unconscious bias, systemic, structural
- racism, sexism, ableism, homophobia, transphobia, xenophobia
- harassment, discrimination, retaliation
- accommodation, accessibility, WCAG
- psychological safety, bystander intervention
Adding these terms to the glossary does not improve caption accuracy because they already transcribe correctly. A glossary should contain terms that are failing, not terms that are passing. The pre-production vocabulary audit — described in the QA methodology post — provides a systematic way to identify which terms in a specific content batch actually need glossary support, rather than making glossary decisions based on assumptions about which terms might fail.
DEI glossary update triggers
Unlike the technical product glossary, which updates primarily when new software products are released or renamed, the DEI glossary requires a more responsive update process. Recommended triggers for a DEI glossary update review:
- Any DEI programme refresh (annual DEI programme review, updated DEI strategy rollout)
- New DEI software vendor onboarding — the product name should be in the glossary before the first training video is captioned
- ERG or BRG name change — the new name should be in the glossary and the old name entry updated before the transition date
- Any guidance update from GLAAD, NCDJ, or AP Stylebook on terminology that affects the organisation's DEI training content
- Any community feedback on terminology in existing captioned DEI content — if a learner or ERG member flags that a specific term in a DEI training caption is outdated or inaccurate, the flag should trigger a glossary review for that term and any related terms
- New DEI training content production start — before any new DEI training module goes to production, the pre-production vocabulary audit should verify that all new proper nouns, acronyms, and community identity terms in the script are in the glossary
The caption programme annual review post covers the full annual review process, including the glossary maintenance component. For DEI content, the annual review should include a specific DEI glossary audit: compare the current glossary entries against the current GLAAD guide, NCDJ guide, and AP Stylebook, and flag any entries that need updating based on evolved guidance.
QA protocol for DEI captions
The standard caption QA protocol for L&D content — described in detail in the caption QA methodology post — uses the DCMP (Described and Captioned Media Program) framework for measuring accuracy across four error types: omission, substitution, insertion, and typographical. The same protocol applies to DEI training content, with two additions that address the specific failure modes of this content category: a DEI terminology review step and a modified error taxonomy that distinguishes accuracy errors from terminology advisory flags.
Who reviews DEI captions
Standard caption QA assigns review to a subject-matter-knowledgeable reviewer who can assess whether the caption text accurately reflects what was said and whether domain-specific vocabulary is correctly rendered. For technical training content, this is a reviewer who knows product names and SDK vocabulary. For compliance training, this is a reviewer who knows regulatory citation formats and legal terminology. For DEI training, the relevant domain knowledge spans two distinct competency areas, and a single reviewer rarely has both.
The recommended two-reviewer model for DEI captions:
- L&D QA reviewer: follows the standard DCMP protocol — spot-checks the audio accuracy of the caption text against the audio, marks omissions, substitutions, insertions, and typographical errors, calculates the accuracy percentage, and confirms that the accuracy meets the WCAG 2.1 AA 99% threshold. This reviewer does not need specific DEI vocabulary knowledge; they need audio accuracy assessment skills.
- DEI terminology reviewer: reviews the caption text specifically for DEI vocabulary accuracy — are all acronyms for community identities correctly rendered? Are pronoun constructions accurate? Are product names, tribal nation names, and cultural terms correct? This reviewer uses a shorter checklist targeting the known failure categories rather than the full DCMP spot-check protocol.
The DEI terminology reviewer's checklist for each piece of DEI training content:
- All community identity acronyms (BIPOC, LGBTQIA+, AAPI, MENA, DEIB, ERG, BRG, JEDI) — verify correct form and capitalisation
- All pronoun constructions — verify singular they/them is preserved; verify neopronoun transcription against the audio; verify pronoun pair separators match the canonical form in the style guide
- All tribal nation and Indigenous community names — verify against the glossary or the DEI team's style guide
- All DEI product names (CultureAmp, Textio, Workday Peakon, etc.) — verify correct capitalisation and compound handling
- All diaspora community vocabulary — verify that Latinx, Filipinx, Afro-Latinx, and similar terms are correctly rendered
- Any terms the DEI team has flagged as high-priority for this content batch during the pre-production vocabulary audit
The editorial boundary in DEI caption review
The QA reviewer for DEI captions operates under a strict editorial boundary that must be explicitly stated in reviewer training: the caption is a transcription of the audio, not an editorial revision of the content. This boundary is more operationally significant for DEI captions than for other content types because DEI terminology is contested and evolving, and a reviewer with DEI vocabulary knowledge may be tempted to "improve" what a speaker said by substituting the currently preferred term for the term the speaker used.
The specific scenario: a DEI training video from two years ago features a guest speaker who uses "Latinx" throughout. The organisation's DEI team has since updated its style guide to prefer "Latine." The QA reviewer, aware of the style guide update, considers substituting "Latine" for "Latinx" in the caption text. This is a word substitution error. The caption should accurately reflect what the speaker said. If the speaker said "Latinx," the caption reads "Latinx." If the organisation has concluded that "Latine" is the preferred current term, the appropriate response is to flag the content for re-recording — not to edit the caption to say something the speaker did not say.
The same principle applies to any terminology where the speaker's usage and the organisation's current style guide diverge. The caption reviewer's job is accuracy; the content currency decision is a programme governance decision. The caption file that accurately transcribes outdated terminology is not an error — it is an accurate record of what was said. The outdated content is the problem; the accurate caption of that content is not.
Error taxonomy for DEI captions
The standard DCMP error taxonomy — omission, substitution, insertion, typographical — applies to DEI captions. There is one addition: a "DEI terminology flag" category for instances where the accurate transcription of what the speaker said contains a term the DEI team wants to review for programme governance purposes. DEI terminology flags are advisory, not accuracy errors. They are documented separately from the error count and do not affect the accuracy percentage calculation.
Example: a speaker in a DEI training video uses a term that the GLAAD guide has recently retired as outdated. The caption accurately transcribes the term. The DEI terminology flag marks that instance for the DEI team's attention — they may decide to re-record the segment, add a content note to the module, or leave it with a note in the programme review file. The flag does not constitute an accuracy error; it is an advisory notification that the content may need programme-level attention.
Sample size and frequency
Follow the same protocol as standard caption QA: spot-check 3–5% of the DEI library, or 20 videos minimum, whichever is larger. Run the spot-check annually as part of the caption programme annual review. For newly produced DEI content, apply the spot-check to the first batch of content from any new DEI programme or new speaker type before approving the full production run — this is the same first-batch verification protocol recommended in the QA methodology post for any new content type.
The stratification recommendation from the code-switching section applies here: the QA sample for DEI captions should be drawn from across the speaker diversity of the DEI training library, not as a purely random sample from the overall content volume. A random sample from a DEI library that has more content featuring some speaker groups than others will produce a QA result that is weighted toward the higher-represented speaker groups, potentially masking accuracy gaps for underrepresented content types.
For the ROI framing of the DEI caption QA investment — how to present the cost of the two-reviewer model to finance and legal leadership — see the caption ROI framing post. The argument is not complicated: the cost of systematic inaccuracy in DEI training captions (reputational, governance, and ADA Title I exposure) substantially exceeds the marginal cost of the DEI terminology review step.
Compliance framework for DEI captioning
The compliance framework for DEI training captioning does not require a separate legal analysis from the framework for any other mandatory workplace training. The same obligations apply. What is worth making explicit is that there is no DEI exception, and there is a governance alignment argument that is specific to this content category.
ADA Title I: the same obligation as any other training
DEI training video has the same captioning obligation under ADA Title I as any other mandatory or strongly encouraged workplace training content distributed through an employer-controlled channel. The WCAG 2.1 AA 99% accuracy standard applies. The WCAG prerecorded captions requirements apply from the date the content is made available to employees. The captioning obligation is not modified by the content's subject matter.
If DEI training is labelled "voluntary" but employees are strongly encouraged to complete it, or if completion is tracked and factored into performance reviews or recognition, the practical accessibility obligation applies even if the training is not technically "mandatory." The auto-captions WCAG compliance post covers the distinction between automatic caption quality and WCAG 2.1 AA compliance: platform-generated auto-captions (YouTube, Zoom, Teams, Vimeo) do not meet the WCAG 2.1 AA 99% accuracy standard for DEI training content, particularly given the specific vocabulary failure modes documented throughout this post. Auto-captions on a DEI training video will produce the exact errors described here — BIPOC rendered as "by poke," pronouns mangled — because they use the same ASR models without glossary support.
The governance alignment argument
Beyond the legal obligation, there is a governance argument specific to DEI training content that L&D directors and DEI leaders should be aware of when making the case for DEI captioning investment.
An organisation whose DEI training programme is not accessible to deaf or hard-of-hearing employees is communicating — through its operational decisions — that accessibility is not a genuine priority in the same session that it discusses inclusivity. The learner who is deaf and encounters an uncaptioned DEI training module, or a DEI training module whose captions render their community's name as a phonetic approximation, receives a clear signal about the organisation's operational commitment to inclusion. That signal is delivered by the accessibility layer of the very content whose purpose is to build a more inclusive culture. This is not a separate legal argument from ADA Title I — it is a governance argument about the alignment between stated DEI commitments and programme design decisions.
The practical implication: L&D directors and DEI leaders who are jointly designing a DEI training programme should address captioning in the programme design phase, not as an accessibility retrofit after the content is produced. Caption production should be part of the content production plan from the beginning, with the glossary preparation process (DEI vocabulary audit, term sourcing, glossary population) built into the pre-production phase. Retrofitting captions onto a completed DEI training library that was designed without accessibility in mind is more expensive and produces lower-quality results than building caption production into the design phase. See the building a caption compliance programme post for the programme design framework.
New employee onboarding and DEI content
Many organisations integrate DEI training into new employee onboarding: values, culture, code of conduct, unconscious bias training, ERG introduction, and similar content is delivered as part of the onboarding experience in the first days or weeks of employment. The captioning obligation for onboarding content applies from the first day of employment — a new employee who is deaf or hard-of-hearing should have captioned access to all onboarding content, including DEI training, from day one of their employment.
The ADA Title I accommodation request clock starts on day one. An employee with a documented hearing loss who cannot access DEI onboarding content because it is uncaptioned or has inaccurate captions has an ADA Title I accessibility barrier from the moment of onboarding. The onboarding captioning post covers the full operational framework for accessible onboarding content; the DEI-specific vocabulary considerations in this post apply with equal force to DEI content that is part of the onboarding programme.
Live DEI events and recorded archives
Town halls, listening sessions, panel discussions, community conversations, and DEI workshops are frequently held as live events and subsequently recorded for distribution as training or reference content. The captioning treatment of these two forms is different.
For live events: the ILT and virtual classroom captioning playbook covers live event captioning mechanics. Real-time captioning (Communication Access Realtime Translation — CART) has different accuracy parameters than post-production captioning and will not meet the WCAG 2.1 AA 99% accuracy standard for recorded content. Live captioning is a reasonable accommodation during the live event; it is not an acceptable substitution for post-production captions on the archived recording.
For recorded archives of live DEI events: when a recording of a live DEI town hall, listening session, or panel discussion is subsequently distributed through the LMS or intranet as training content, the pre-recorded WCAG standard applies. The live-event caption quality — which may have been significantly lower because CART operates in real time — must be replaced with post-production captions before the archive is distributed. Distributing the live-event CART captions as the caption track on the archived version is a compliance risk: those captions will not meet the 99% accuracy standard, and for DEI content with the specific vocabulary failure modes documented here, the CART captions will compound the ASR accuracy problems with the time pressure of real-time transcription.
Compliance training with DEI elements
Some compliance training content uses DEI vocabulary extensively: harassment prevention training, anti-discrimination training, code of conduct modules, and EEO training are all mandatory compliance training categories that also address DEI vocabulary. These modules sit at the intersection of compliance training obligations — covered under ADA Title I mandatory-training analysis — and DEI vocabulary accuracy. The compliance training captions page covers the captioning obligations for mandatory compliance training; the DEI-specific vocabulary considerations in this post apply with equal force to harassment prevention and anti-discrimination training that uses community identity acronyms, pronoun language, and cultural vocabulary. The glossary preparation process should treat these compliance-DEI hybrid modules the same way as pure DEI training: pre-production vocabulary audit, glossary population, and DEI terminology review in the QA step.
Section 508 for federal contractors
Federal contractors and subcontractors with DEI training programmes distributed through internal LMS systems have Section 508 captioning obligations in addition to ADA Title I obligations. The Section 508 standard maps to WCAG 2.0 AA for pre-recorded video content. The Section 508, ADA, and WCAG compliance matrix post covers the full compliance mapping for federal contractors. For DEI training programmes at federal contractor organisations, the same vocabulary accuracy requirements apply — the Section 508 standard does not provide a lower accuracy threshold for DEI content, and the specific failure modes documented in this post affect federal contractor DEI training captioning as fully as private-sector DEI training.
The ADA Title II post covers the captioning obligations for state and local government entities and, from April 2026 enforcement, public university and government programme digital content. Public universities with DEI training programmes for faculty and staff are subject to both ADA Title I (as employers of faculty and staff) and ADA Title II (as public entities) captioning obligations; the DEI training vocabulary accuracy requirements in this post apply to both obligation sources.
For DEI content distributed through internal video channels — all-hands recordings, town hall archives, DEI listening session recordings shared via intranet — the internal video captioning post covers the captioning workflow for employer-hosted video that is not in the LMS but is still subject to ADA Title I obligations as employer-distributed content. DEI internal communications are a common gap in accessibility programmes that have focused primarily on formal LMS training content. For the broader compliance reporting framework — how to report DEI caption accuracy to legal, compliance, and DEI leadership, and how to structure the caption compliance data to demonstrate ADA Title I adherence — see the caption compliance reporting post. The data structure for DEI training caption compliance reporting should include stratified accuracy data by content category, not just an aggregate accuracy percentage for the full training library.
Eight failure modes in DEI training captioning
- 1. Treating DEI vocabulary as "general enough" without verifying
- The assumption that inclusive language terms are well-represented in ASR training data is accurate for some terms (intersectionality, allyship, implicit bias) and significantly wrong for others (BIPOC, LGBTQIA+, cisgender, Latinx, AAPI, MENA, AuDHD). The L&D team that assumes its ASR vendor handles DEI vocabulary without running a pre-production vocabulary audit will discover the errors after the fact — in the caption file review, or worse, after the content has been distributed to the full learner population. The pre-production vocabulary audit — reviewing the training script and flagging any term whose ASR performance is uncertain, then testing against the glossary — prevents this failure. The audit takes a fraction of the cost of post-production correction across a large content library. See the hidden FTE cost post for the downstream cost of errors that propagate before they are caught.
- 2. Grammar correction post-processing that modifies singular they/them
- Any captioning workflow that includes an automated grammar normalisation step risks "correcting" accurate singular they/them constructions. The result is a caption that says "he or she" where the speaker said "they," or "he shared his experience" where the speaker said "they shared their experience." This is both an accuracy error (the caption does not match the audio) and a pronoun error (the speaker's pronoun choice has been overridden by the grammar correction tool). Grammar correction tools that have been applied without specific singular-they exclusion rules are a hidden risk in captioning workflows that include manual review steps involving word processors or grammar-checking APIs. Audit the captioning workflow before processing any DEI content and confirm that any grammar normalisation step is either disabled for singular they/them or removed from the workflow for DEI content entirely.
- 3. QA reviewer without DEI vocabulary knowledge
- A standard L&D caption QA reviewer can assess audio accuracy — does the caption text match what the speaker said — but cannot reliably assess terminology accuracy for DEI vocabulary without domain knowledge. A QA reviewer who does not know that "BIPOC" is an acronym for Black, Indigenous, and People of Color may pass "by poke" as phonetically close enough. A reviewer who does not know that "Textio" is a DEI software product name may accept "text I.O." as accurate. The two-reviewer model — L&D QA reviewer for audio accuracy, DEI terminology reviewer for vocabulary accuracy — addresses this directly. The additional reviewer step is not a full second review of the entire caption file; the DEI terminology reviewer uses the abbreviated checklist focused on the known failure categories. The marginal time investment is low relative to the accuracy improvement.
- 4. No glossary update when a new DEI vendor product is deployed
- When the organisation rolls out a new DEI platform — CultureAmp for engagement surveys, Textio for job description bias analysis, Workday Peakon for pulse surveys, Leapsome for performance and recognition — the product name should be added to the caption glossary before the first training video about that product is captioned. The failure mode: the DEI team selects a new platform, procurement finalises the contract, IT begins implementation, and L&D produces a training video for the new tool — all before anyone updates the caption glossary. The training video is captioned and distributed before the glossary is updated, producing "Culture camp" or "culture amp" (inconsistently cased) throughout the video. The error propagates through the DEI training library until the annual QA review identifies it, at which point the correction cost is multiplied by the number of affected caption files. The fix: add product name glossary update to the DEI vendor procurement checklist as a required step before the first training video is submitted for captioning.
- 5. Editorial substitution for speaker terminology
- A caption reviewer who substitutes a style-guide-preferred term for what a speaker actually said has committed a word substitution error, potentially misrepresented the speaker, and created a transcript that does not match the audio. This failure mode is specific to DEI captions because DEI terminology is contested and evolving — a reviewer who knows the "preferred" current term may be tempted to "improve" what a speaker said. The failure has three distinct problems: it is an accuracy error by the DCMP standard (the caption contains a word that differs from what was said), it creates an audio-caption discrepancy that deaf learners who also lip-read or who are using the caption file as a text record will notice, and it potentially misrepresents the speaker by attributing to them a specific term they did not choose to use. The organisational response is explicit reviewer training: the caption is a transcription of the audio, not an editorial revision. If the terminology in the content is out of date, the content governance decision — re-record, add a content note, or decommission — is a programme level decision that happens outside the caption review workflow. The caption reviewer does not have the authority to make that decision through caption editing.
- 6. No protocol for neopronoun transcription before production
- When a speaker who uses neopronouns is featured in a DEI training video, the captioning workflow needs to account for neopronoun transcription before production starts — not discover the problem when the caption file comes back with "sir" where the audio had "zir," "them" where the audio had "xem," and "hey" where the audio had "ey." Without pre-production intervention, the neopronoun transcription will be a phonetic approximation that may be significantly wrong and that will require manual correction on every instance in the caption file. The pre-production protocol: (1) identify all speakers whose pronouns include neopronouns before production; (2) consult with the DEI team or the speakers directly on the canonical spelling for each neopronoun set; (3) add glossary entries for each neopronoun, or specify human transcription for segments where they appear; (4) define the canonical pronoun pair separator in the style guide (ze/zir with forward slash); (5) include an explicit neopronoun accuracy check in the QA step. The protocol adds minimal pre-production overhead and prevents a category of errors that is both technically significant (every neopronoun instance in the content is wrong) and reputationally significant (the training content designed to support inclusive pronoun practice is demonstrating inaccurate pronoun transcription).
- 7. No re-caption trigger for terminology updates
- When the organisation's DEI style guide is updated — a term is retired, a new term is adopted, a community name changes — existing captioned DEI content that uses the old terminology is not automatically updated. The old caption files remain accurate as transcriptions (they captured what was said), but the content is now distributing outdated terminology. The governance question is: does the organisation re-caption existing content when terminology changes? For DEI content in active mandatory distribution — annual harassment prevention training, onboarding values modules — the recommended answer is yes, with the same priority weighting as compliance content revisions. For archived reference content that is infrequently accessed, a content note acknowledging that the terminology reflects usage at the time of recording may be sufficient. The governance decision should be documented in the caption programme governance policy so that the response to terminology updates is consistent and does not depend on individual decisions by whoever happens to manage the caption file at the time. Add a DEI terminology update trigger to the list of events that prompt a content revision audit as part of the annual review process.
- 8. Applying uniform accuracy targets without stratified QA
- If the organisation's caption accuracy target is 99% WCAG across the DEI training library, that target should be verified with QA data stratified by content category and by speaker profile — not just as an aggregate accuracy result for the full library. An aggregate 99% result for the DEI training library may mask 97% accuracy on content featuring Black speakers using AAE features, 99.2% on content featuring white speakers using Standard American English, and 98.5% on content featuring neopronoun-using speakers. The aggregate result passes the WCAG 2.1 AA threshold; the stratified data reveals systematic accuracy gaps that are invisible at the aggregate level. Stratified QA is the only way to identify differential accuracy before it becomes a pattern embedded in the full DEI training library — a pattern that, for content designed to address equity and inclusion, represents a specific operational failure of the equity commitment the content is meant to support. The QA methodology post covers the statistical approach to stratified sampling; apply the same methodology to the speaker-profile stratification for DEI content.
FAQ
- Should caption text reflect what a speaker said or what the organisation's style guide says?
- Caption text must reflect what the speaker said — always. A caption is a transcription of audio, not an editorial revision. If a speaker uses terminology that the organisation's style guide has moved away from — an older term for a community identity, a word whose usage context has shifted — the caption accurately transcribes what was said. The caption reviewer does not have the authority to substitute the currently preferred term for what was actually spoken. If the terminology in the content is sufficiently outdated that the content needs revision, the correct response is to re-record the audio. Editing caption text to substitute preferred terminology for what was actually spoken is a word substitution captioning error, creates an audio-caption discrepancy that deaf viewers will notice when they compare the caption text to what they can lip-read or to the audio that hearing assistive devices may partially amplify, and potentially misrepresents the speaker. The caption QA reviewer's job is accuracy, not editorial alignment with the current style guide.
- How do we handle neopronoun transcription in DEI content?
- Before production: add all neopronouns used by featured speakers to the caption glossary with their canonical spelling — ze/zir, xe/xem, ey/em, or whichever forms the speakers use, confirmed with the speakers or the DEI team. If the captioning vendor's glossary system does not reliably support single-syllable phonetically ambiguous entries (test this before relying on it), specify human transcription for segments where neopronouns appear rather than relying on ASR plus glossary injection. In the QA step, include an explicit neopronoun accuracy check: verify every instance of each neopronoun against the audio, confirming the correct spelling and the correct pronoun pair separator as specified in the organisation's caption style guide. Document the neopronoun handling approach in the style guide so that future DEI content featuring neopronoun-using speakers is handled consistently. Neopronouns are the highest-failure-rate vocabulary category in DEI captions; without pre-production intervention, the output will be a phonetic approximation that may be significantly different from the correct form and that will require manual correction on every instance.
- Are current ASR systems less accurate on content featuring Black speakers?
- Yes, in measurable ways, though the gap has narrowed with foundation-model-based ASR compared to earlier systems. The 2020 Koenecke et al. study published in the Proceedings of the National Academy of Sciences found that five major ASR systems had approximately twice the word error rate on Black speakers compared to white speakers in matched demographic groups — attributable primarily to African American English phonological and grammatical features underrepresented in training data, not to acoustic differences between speakers. More recent evaluations of Whisper-class models show a reduced but persistent gap, particularly on content with strong AAE features such as habitual "be" marking and copula deletion. The practical implication for DEI training captioning: content featuring Black speakers using AAE features may produce systematically lower-accuracy output than equivalent content featuring speakers using Standard American English. This requires stratified QA to detect, and the appropriate operational responses — audio quality improvement before captioning, enhanced QA sampling for affected content, human review for segments where ASR consistently underperforms — address the gap without ignoring it. The gap does not mean that DEI content featuring Black speakers is impossible to caption accurately; it means that the accuracy requires more careful QA verification than a uniform sampling approach would detect.
- Should the DEI glossary be separate from the main L&D caption glossary?
- The recommended approach is an integrated glossary with DEI-ownership tagging rather than separate systems. Separate systems risk contradictory entries (the same acronym defined inconsistently in two glossaries) and create maintenance overhead with two parallel review processes. An integrated glossary with DEI-tagged entries — where the DEI team approves changes to DEI-tagged terms and the L&D team manages all other entries — preserves the single-source-of-truth benefit and the compound accuracy effect from the feedback loop, while respecting the DEI team's organisational standing to make authoritative decisions about DEI terminology. The update cadence for DEI-tagged entries can be managed separately within the same system: DEI glossary entries review on a more frequent trigger cycle than technical product entries, reflecting the faster evolution of DEI terminology. For the full decision framework, including the scenarios where a separate DEI glossary is appropriate, see the glossary architecture post.
- Who should perform QA review of DEI captions?
- A two-reviewer model addresses the dual competency requirement of DEI caption QA: the standard L&D QA reviewer checks audio accuracy using the DCMP protocol (does the caption text match what was said? are there omissions, substitutions, insertions, or typographical errors?), and a DEI-team reviewer checks terminology accuracy using a focused checklist (are community identity acronyms correctly rendered? are pronoun constructions accurate? are product names, tribal nation names, and cultural terms correct?). The DEI-team reviewer does not need to follow the full DCMP spot-check protocol — that is the L&D reviewer's task. The DEI terminology review uses a shorter, targeted checklist that can typically be completed in less time than the full audio accuracy review. The two reviews can be conducted sequentially in any order; the DEI terminology review does not depend on the audio accuracy review being complete first. The key operational point: a single reviewer who has only one of these competency areas will miss systematic errors in the other. The DEI training caption error that most frequently escapes notice is the terminology accuracy error that passes the phonetic audio check — "by poke" for "BIPOC" — and requires DEI vocabulary knowledge to identify as an error at all.
- What do we do with archived DEI content that uses outdated terminology?
- This is a content governance decision, not a captioning decision. The caption accurately transcribes what was said; the terminology is out of date because the audio is out of date. The caption file for archived DEI content that uses terminology the organisation has since moved away from should not be edited to substitute preferred terminology — the caption is an accurate record of the audio, and editing it to say something the speaker did not say creates a false record. The governance options are: (a) re-record the content with current terminology and re-caption — appropriate for content in active mandatory distribution where the terminology gap is significant enough to undermine the training purpose; (b) add a content note to the module acknowledging that the terminology reflects usage at the time of recording and directing learners to the current DEI style guide or a more current version of the content; (c) decommission and replace — appropriate for content where the terminology is so outdated that the content is no longer serving its purpose. For content in active mandatory distribution (annual harassment prevention training, onboarding values content), re-recording is recommended. For archived reference content that is infrequently accessed, a content note may be sufficient pending a full content revision. The accessibility coordinator and DEI team should make this determination jointly; it is not a unilateral L&D decision, and it should be documented in the caption programme governance policy as a formal decision record.
- Does the ADA captioning obligation apply differently to voluntary DEI training vs. mandatory compliance training?
- No — the ADA Title I captioning obligation applies to any training content that is distributed to employees as part of employment, whether it is labelled mandatory or voluntary. If the content is accessible to employees through an employer-controlled distribution channel (LMS, intranet, SharePoint, email link), and the employer's expectation — explicit or implicit — is that employees engage with it, the accessibility obligation applies. "Voluntary" DEI training that employees are strongly encouraged to complete, that managers track completion of, or that is factored into performance feedback is not materially different from mandatory training for captioning purposes. The functional question is not what the training is labelled but whether a deaf employee has equivalent access to the content the employer is making available. If the answer is no, the ADA Title I obligation applies regardless of the mandatory or voluntary label on the content. The same analysis applies to DEI content distributed through an internal communications channel — a DEI podcast series, a recorded town hall about inclusion strategy, a facilitated listening session recording — that is made available to all employees through an intranet page or email distribution. For the full legal analysis of which content types trigger which obligations, see the building a caption compliance programme post.
Caption your DEI training library with vocabulary accuracy that matches your commitment to inclusion
GlossCap applies your organisation's DEI glossary — community identity acronyms, pronoun forms, product names, tribal nation names, ERG names, and all the vocabulary that standard ASR gets wrong — automatically to every DEI training video you submit. The result is caption accuracy that meets the WCAG 2.1 AA 99% standard for content where accuracy on the right terms matters most. Your DEI team owns the glossary entries for DEI-tagged terms. Your L&D team manages the production workflow. The feedback loop compounds accuracy across every subsequent DEI training video you caption.
See GlossCap pricing Learn how GlossCap works