Multilingual

Multilingual conversations, code-switching, and accents: what transcription tools fumble

Most transcription tools are English-first and quietly poor at code-switching, accented speech, and bilingual research. Here is what to demand from a tool — and what to expect.

February 4, 20269 min read6 sections

The English-first ceiling

Almost every transcription product on the market launched English-first and added other languages later. The depth of investment shows up in subtle places — accent variety in training data, idiomatic phrase handling, spelling and grammar conventions, named-entity recognition. A tool that calls itself multilingual but treats English as the default everywhere is a tool that will quietly degrade for any work that is not English.

For users in bilingual or non-English-dominant work, this becomes visible at the boundaries: a Spanish word transcribed phonetically in an English session, an English brand name mangled in a Portuguese transcript, names of non-Western interviewees rendered as roughly-similar English words. The tool is not "wrong" — it is operating at the edge of its training distribution, where errors compound.

60+

Languages claimed

Top consumer tools

Languages with research-grade quality

Same tools

~1.5B

Bilingual speakers worldwide

Underserved by current tools

Code-switching: when one sentence has two languages

Code-switching — alternating between two languages within a single conversation, often within a single sentence — is the daily lived experience of bilingual speakers. "Tengo que finish this report antes del weekend" is a normal English-Spanish utterance. Most transcription tools require you to set the language up front and stay in it. The result is that one half of every code-switched sentence comes out as phonetic mush.

A small but growing set of models — Gladia's Solaria-1 is the most prominent — handles intra-sentence code-switching natively. Whisper Large v3, with prompt engineering, handles it imperfectly. Most consumer tools still do not. If your work involves bilingual sources or participants — research, journalism, clinical work in immigrant communities — this is the single feature that decides whether a tool will work.

Why tools force you to "set the language"

The "set the language" requirement most tools impose is an artifact of how older speech-recognition pipelines worked. Different language packs, different acoustic models, different post-processing. Switching mid-stream was technically expensive and rarely tested.

Newer end-to-end models do not have this constraint architecturally — they predict tokens that span all the languages they were trained on simultaneously. The reason consumer tools still expose the "set the language" UI is that auto-detection adds latency and occasionally picks wrong on short audio. The tools that have invested in this — almost always the ones built for non-Anglophone markets first — handle it gracefully. The ones that have not still ship the legacy UI.

Accent fairness — the WER gap nobody publishes

Independent fairness benchmarks have repeatedly shown that ASR accuracy varies by speaker demographic — accent, dialect, gender, age — by 10-30 percentage points on the same model. This is the WER gap nobody publishes voluntarily. The headline number is on a curated test set; the long tail of speakers experiences a much weaker number, and almost no vendor reports this transparently.

"99% accurate"

On a curated standard-English test set
Single-speaker, studio audio
Most major tools claim this
Tells you very little about your audio

What you should ask for

WER broken out by accent group
WER on conversational, multi-speaker audio
WER on your specific recordings (real test)
Code-switching benchmark, if applicable

How to interpret a vendor accuracy claim

If a vendor cannot answer the right-hand questions, that is itself information. The vendors that have invested in accent fairness have specific numbers ready — by accent group, with caveats. The ones that have not will pivot back to the marketing number. That gap is your signal.

Choosing a tool for bilingual research and journalism

01Test on a real recording from your work, not the vendor’s demo.
02Test on the worst recording you have — accented, noisy, code-switched. The tool that handles your worst case is the one to pick.
03Confirm code-switching support in writing. "Yes, supports both languages" is not the same as "yes, handles code-switching mid-sentence."
04Verify that exports preserve diacritics, non-Latin characters, and right-to-left scripts where relevant.
05For research, confirm the participant intake form covers consent for transcription and voice enrollment in the language the participant prefers.

A starting checklist for non-English work

Claim	What to ask	What to test
"Supports 60+ languages"	Which 6 are research-grade?	Run a real recording
"99% accuracy"	On which accent group?	Test on accented audio
"Auto language detection"	In a code-switched recording?	Mid-sentence Spanish/English
"International support"	Data residency in EU? LATAM?	Check storage region
"Speaker labels in any language"	Same persistence guarantees?	Two-session test

Vendor claim vs. real-world question

The tool that survives this checklist is the right tool for non-English work. The other 95% are perfectly fine for English meetings and English podcasts, but they will fail you the moment your audio looks like the world your sources actually live in. Pay the procurement cost up front; the alternative is rework, every recording, forever.

“Multilingual is not a feature you bolt on. It is a posture about whose conversations the product was built to take seriously.”

Keep reading

Multilingual conversations, code-switching, and accents: what transcription tools fumble

The English-first ceiling

Code-switching: when one sentence has two languages

Why tools force you to "set the language"

Accent fairness — the WER gap nobody publishes

Choosing a tool for bilingual research and journalism

A starting checklist for non-English work

The Speaker 1 problem: why every transcription tool fumbles who said what

Audio to text in 2026: a guide that actually accounts for accuracy, speakers, and privacy

Video to text: how to convert video to clean, usable transcripts without losing context