Multilingual

Translate audio to text: multilingual transcription that actually works

How to translate audio to text and translate voice to text reliably across languages — accents, code-switching, and the limits to know.

May 19, 20258 min read5 sections

Two different jobs that share a phrase

When someone searches "translate audio to text," they could mean two different operations and the result they want depends entirely on which. The first is transcribe-then-translate: capture the original-language words faithfully, then produce a separate translated version. The second is translate-on-the-fly: take Spanish audio in, get English text out directly. Both are useful; both have different accuracy profiles; the marketing pages happily blur them together.

The same ambiguity hits "translate voice to text." A voice-to-text app on a phone that turns spoken Spanish into English text is doing translate-on-the-fly; a transcription service that captures the Spanish and then translates a copy to English is doing transcribe-then-translate. The choice between them matters more than people realize.

Transcribe-then-translate: the safer default

Transcribe-then-translate is the workflow most professional pipelines use. You capture the source language with a speech model, then run a separate translation step on the resulting text. The advantage is that you keep both versions: the faithful transcript in the original language and the translation as a derivative. Errors in either layer are inspectable separately, and you can re-translate without re-transcribing if you change your mind about the target language.

Modern speech models do this well in dozens of languages. Spanish, French, Mandarin, Arabic, Hindi, German, Portuguese, Japanese — all standard. The transcription pass produces a text in the source language, the translation pass produces text in the target language, and a good UI shows them side by side.

Pro: original-language transcript preserved for audit.
Pro: translation is a separate pass, swappable for a different target.
Pro: errors localize to one of the two passes.
Con: two steps means slightly longer wall-clock time.

Translate-on-the-fly: when it actually fits

Translate-on-the-fly skips the source-language transcript entirely. Spanish audio goes in; English text comes out. Whisper offers this as a built-in mode (translate to English specifically, supported by their multilingual training data). It is faster than the two-step pipeline and the experience is cleaner if you only ever wanted the target language. The downside: if the translation has an error, you have no source-language reference to compare against.

In practice, translate-on-the-fly works well for low-stakes use cases — captioning a foreign-language video for personal use, getting the gist of a conversation, translate voice to text in a chat app — and is risky for anything legally or editorially important. If a single mistranslated word matters, the transcribe-then-translate path is the safer choice.

Accents and code-switching: the hard cases

Even within a single language, accents and code-switching are where transcription accuracy drops fastest. A Spanish-language transcript of someone speaking Castilian Spanish reads differently from one of Mexico-City Spanish, and the difference between models on these dialects is often larger than the difference between models on different languages. Code-switching — alternating between two languages mid-sentence — is the worst case; most models commit to one language per recording and produce visibly degraded output where the speaker switches.

7-10%

WER on standard accent

Same language as training

20-35%

WER on heavy accent

Same language, different region

40%+

WER on code-switched

English/Spanish in one sentence

Accent-handling has gotten meaningfully better in 2025 and 2026 with multilingual training data, but it is still the hardest problem in transcription. If your audio has heavy accents or frequent code-switching, do not assume the marketing-page accuracy numbers apply. Run your own 60-second test before committing.

A practical workflow for multilingual recordings

Three steps that get most multilingual jobs to a usable result without overengineering.

01Set the source language explicitly. Auto-detect is convenient but wrong often enough that you should specify when you can.
02Transcribe in the source language first. Faithful capture is the foundation; everything else builds on it.
03Translate as a second pass to the target language. Keep both files. If you ever need to verify a translation choice, you have the source.

For the second pass, modern translation services (DeepL, Google Translate, GPT-4o, Claude, etc.) handle paragraph-level translation extremely well. You get faithful translation that respects the structure of the transcript instead of a literal word-for-word dump.

Keep reading

Translate audio to text: multilingual transcription that actually works

Two different jobs that share a phrase

Transcribe-then-translate: the safer default

Translate-on-the-fly: when it actually fits

Accents and code-switching: the hard cases

A practical workflow for multilingual recordings

The Speaker 1 problem: why every transcription tool fumbles who said what

Audio to text in 2026: a guide that actually accounts for accuracy, speakers, and privacy

Video to text: how to convert video to clean, usable transcripts without losing context