STT vs TTS
Speech to text vs text to speech: the 2026 disambiguation guide for users who landed in the wrong product
Speech to text and text to speech share words but solve opposite problems. Here is the honest disambiguation, when each makes sense, and the tools that answer each direction in 2026.
Opposite products, overlapping vocabulary
A surprisingly large fraction of people who land on transcription product pages are looking for the opposite operation — they want to convert text to speech, not speech to text. The vocabulary overlaps so heavily that search engines routinely surface the wrong result. "Voice," "audio," "speech," and "text" appear in both directions, and the verbs "convert," "turn," "generate," and "make" are used identically. So a user who needs an ai voice generator to read aloud a script ends up on a transcription page, and a user who recorded an interview ends up reading about text to speech ai. Both leave frustrated.
This guide is the disambiguation. We will say plainly which direction is which, the tools that solve each one, and how to recognise — from your own search query — which product you actually need. We are a transcription company (speech-to-text); we do not sell text-to-speech. So this article will not pitch you on either; it will help you find the right tool whichever direction you came from.
The two directions, named clearly
Speech to text (STT, transcription)
- Input: audio file or live mic recording
- Output: text (transcript, captions, .srt)
- Examples: meeting notes, podcast transcripts, lecture notes
- Tools: TigerScribe, Otter, Whisper, AssemblyAI, Deepgram, Rev
Text to speech (TTS, voice synthesis)
- Input: written text (paragraph, article, document)
- Output: audio file (MP3, WAV) of synthetic voice reading the text
- Examples: audiobook narration, AI voice over for video, accessibility
- Tools: ElevenLabs, NaturalReader, Murf, Play.ht, Google Cloud TTS
A useful test: if you have a recording and want to read what was said, you need transcription (speech to text). If you have something written and want to hear it spoken, you need text to speech. The verbs "convert" and "turn" sit on both sides; they tell you nothing about direction. The objects do — "audio to text" is transcription, "text to audio" is synthesis.
Where text to speech belongs
Text to speech is a real product market with legitimate uses. The most common use cases for tts in 2026 are video voice-overs (YouTube creators using ai voice over to narrate without recording themselves), audiobook narration (publishers turning manuscripts into audio without paying voice actors), accessibility (screen readers reading text aloud for blind users, dyslexic users), e-learning (course narration), IVR / phone systems (the voice you hear when calling customer support), in-car navigation, and increasingly conversational AI assistants. Each of these wants the OPPOSITE of transcription — text in, audio out.
The leading text to speech tools as of 2026 include: ElevenLabs (the gold standard for realistic ai voice and voice cloning, both paid and a limited free tier), NaturalReader (popular for accessibility and document reading, with a free tier and paid pro), Murf (focused on marketing and video voice-overs with a clean web UI), Play.ht (similar to Murf, more developer-focused), Google Cloud Text-to-Speech (the API for developers, billed per character, very high quality), Apple Speech (built into macOS / iOS for system narration), Microsoft Azure Speech (the API equivalent on Azure), and an ecosystem of mobile apps that wrap the major APIs for end-users. For "best ai voice generator" or "free ai voice generator," ElevenLabs is the most-cited starting point; for free without account hassle, the browser TTS API in Chrome (window.speechSynthesis) is genuinely capable for short snippets.
Free text to speech: what genuinely exists
For "free text to speech," the honest options in 2026: the browser Web Speech API (built into Chrome and Edge — works in any web app, unlimited, free), Apple Speech (built into every Mac, iPhone, and iPad — free), Microsoft Read Aloud (built into Word and Edge — free), TTSReader and similar web tools that wrap the browser API into a paste-and-play UI (free with ads), and the free tiers of major commercial tools (ElevenLabs gives 10K characters free per month, NaturalReader gives daily quota, Murf has a 10-minute trial). For "best free text to speech online" specifically, the trade-off is voice quality vs. usage limits — the browser-native APIs are unlimited but the voices sound robotic; the commercial free tiers give realistic voices but cap usage.
- Browser Web Speech API (Chrome, Edge) — unlimited, robotic, instant.
- ElevenLabs free tier — 10K chars/month, realistic voices, premium voice cloning paywalled.
- NaturalReader free tier — daily limit, decent voices, document upload supported.
- Murf 10-min trial — production-quality, then paid.
- Apple Speech — built into macOS / iOS, free unlimited, decent quality.
- Microsoft Read Aloud — built into Word and Edge, free unlimited.
Voice cloning — a third adjacent product
Voice cloning is a subset of text to speech where you provide a sample of a real human voice (yours, an actor's, with permission) and the tool generates new speech in that voice. ElevenLabs is the dominant tool here; Resemble.ai and a few others compete. "Voice cloning free" exists in limited form on most platforms — typically 1-3 cloned voices on free tiers. The output quality is now indistinguishable from human in most cases, which is also why the ethical and legal context matters — voice cloning without consent is increasingly illegal in many jurisdictions (California Senate Bill 942, EU AI Act provisions, etc.).
For users searching "ai voice clone" or "voice cloning free" — the legitimate uses are creating a personalised audiobook narrator from your own voice, generating consistent narrator voices for a YouTube channel, or recreating a voice for accessibility (someone losing speech to ALS using their own voice for future communication). Voice cloning is NOT transcription; if your goal is to convert recorded audio into text, you want transcription instead.
When the search "text to speech" actually means transcription
A surprisingly common confusion: users search for "text to speech" but actually mean transcription. The pattern is usually: "I have a voice recording of a meeting and want to convert it to text." That phrase technically describes transcription (audio → text), but users typing it into Google often type "text to speech" because the words are similar. If you have a recording and want text out, you need a transcription tool, not a text to speech tool — the names are opposite.
The reverse confusion happens too: "I have a document I want to read aloud" gets typed as "audio to text" by users who think of it as "I have text and want audio." If you have writing and want spoken audio, you need text to speech. The product test: what is your STARTING point — text or audio? If text, you want TTS. If audio, you want STT (transcription).
Keep reading
Speaker Identification
The Speaker 1 problem: why every transcription tool fumbles who said what
9 min →
Audio to Text
Audio to text in 2026: a guide that actually accounts for accuracy, speakers, and privacy
10 min →
Video to Text
Video to text: how to convert video to clean, usable transcripts without losing context
9 min →