Multilingual
Translate audio to text: multilingual transcription that actually works
How to translate audio to text and translate voice to text reliably across languages — accents, code-switching, and the limits to know.
Two different jobs that share a phrase
When someone searches "translate audio to text," they could mean two different operations and the result they want depends entirely on which. The first is transcribe-then-translate: capture the original-language words faithfully, then produce a separate translated version. The second is translate-on-the-fly: take Spanish audio in, get English text out directly. Both are useful; both have different accuracy profiles; the marketing pages happily blur them together.
The same ambiguity hits "translate voice to text." A voice-to-text app on a phone that turns spoken Spanish into English text is doing translate-on-the-fly; a transcription service that captures the Spanish and then translates a copy to English is doing transcribe-then-translate. The choice between them matters more than people realize.
Transcribe-then-translate: the safer default
Transcribe-then-translate is the workflow most professional pipelines use. You capture the source language with a speech model, then run a separate translation step on the resulting text. The advantage is that you keep both versions: the faithful transcript in the original language and the translation as a derivative. Errors in either layer are inspectable separately, and you can re-translate without re-transcribing if you change your mind about the target language.
Modern speech models do this well in dozens of languages. Spanish, French, Mandarin, Arabic, Hindi, German, Portuguese, Japanese — all standard. The transcription pass produces a text in the source language, the translation pass produces text in the target language, and a good UI shows them side by side.
- Pro: original-language transcript preserved for audit.
- Pro: translation is a separate pass, swappable for a different target.
- Pro: errors localize to one of the two passes.
- Con: two steps means slightly longer wall-clock time.
Translate-on-the-fly: when it actually fits
Translate-on-the-fly skips the source-language transcript entirely. Spanish audio goes in; English text comes out. Whisper offers this as a built-in mode (translate to English specifically, supported by their multilingual training data). It is faster than the two-step pipeline and the experience is cleaner if you only ever wanted the target language. The downside: if the translation has an error, you have no source-language reference to compare against.
In practice, translate-on-the-fly works well for low-stakes use cases — captioning a foreign-language video for personal use, getting the gist of a conversation, translate voice to text in a chat app — and is risky for anything legally or editorially important. If a single mistranslated word matters, the transcribe-then-translate path is the safer choice.
Accents and code-switching: the hard cases
Even within a single language, accents and code-switching are where transcription accuracy drops fastest. A Spanish-language transcript of someone speaking Castilian Spanish reads differently from one of Mexico-City Spanish, and the difference between models on these dialects is often larger than the difference between models on different languages. Code-switching — alternating between two languages mid-sentence — is the worst case; most models commit to one language per recording and produce visibly degraded output where the speaker switches.
7-10%
WER on standard accent
Same language as training
20-35%
WER on heavy accent
Same language, different region
40%+
WER on code-switched
English/Spanish in one sentence
Accent-handling has gotten meaningfully better in 2025 and 2026 with multilingual training data, but it is still the hardest problem in transcription. If your audio has heavy accents or frequent code-switching, do not assume the marketing-page accuracy numbers apply. Run your own 60-second test before committing.
A practical workflow for multilingual recordings
Three steps that get most multilingual jobs to a usable result without overengineering.
- 01Set the source language explicitly. Auto-detect is convenient but wrong often enough that you should specify when you can.
- 02Transcribe in the source language first. Faithful capture is the foundation; everything else builds on it.
- 03Translate as a second pass to the target language. Keep both files. If you ever need to verify a translation choice, you have the source.
For the second pass, modern translation services (DeepL, Google Translate, GPT-4o, Claude, etc.) handle paragraph-level translation extremely well. You get faithful translation that respects the structure of the transcript instead of a literal word-for-word dump.
Keep reading
Speaker Identification
The Speaker 1 problem: why every transcription tool fumbles who said what
9 min →
Audio to Text
Audio to text in 2026: a guide that actually accounts for accuracy, speakers, and privacy
10 min →
Video to Text
Video to text: how to convert video to clean, usable transcripts without losing context
9 min →