Academic

Transcription for dissertation interviews: what holds up to a defense

PhD-grade transcription is held to a different bar than business meetings. Your committee can ask to verify a quote on the spot. This guide walks through accuracy thresholds, IRB compatibility, citation conventions, and budget-realistic options for dissertation work.

April 18, 202610 min read8 sections

Why dissertation interviews are different from any other transcript

Most transcription is meant to be ephemeral. A meeting transcript is read once and forgotten. A podcast transcript is published as a courtesy. A dissertation interview transcript, by contrast, will be read carefully by a committee that may pause on a single sentence and ask "where exactly did the participant say that, and can you play me that moment from the recording?" The transcript is part of the audit trail of your dissertation. It cannot be approximate.

That puts dissertation interview transcription under different pressure than business transcription. Accuracy threshold is higher because the wording itself becomes evidence — a participant who says "I almost left the program" is making a different claim than one who says "I left the program," and a sloppy auto-transcript can erase that distinction. Speaker attribution has to be flawless because mis-attributed quotes are worse than missing data. And the data has to be deletable, anonymizable, and IRB-compatible because every grad student lives under a research-ethics protocol.

Most generic transcription products were never designed for this. The tools that hold up to dissertation work share a specific feature set, summarized below. None of it is rocket science; the gap is that meeting-bot products are not built for this audience and treat every requirement as an afterthought.

The accuracy threshold that survives a defense

Word-error-rate (WER) is the most-quoted accuracy metric and the most misleading one. Vendors report numbers in the 95-99% range for clean studio audio, which sounds reassuring until you ask what audio they tested on. Dissertation interviews are usually recorded in offices, coffee shops, kitchens, and over Zoom — none of those conditions match the vendor benchmark.

Realistic WER on dissertation-quality audio sits in the 91-96% range for the better tools, dropping to mid-80s for accented speech, technical vocabulary (medical, legal, engineering), and noisy environments. A transcript with 92% WER has roughly 80 errors per thousand words — manageable for thematic analysis, frustrating for direct-quote work where every misheard word becomes a footnote.

Condition	Best-in-class WER	Median WER
Studio recording, single speaker	2-4%	5-8%
Quiet office, two speakers	4-7%	8-12%
Zoom call, native speakers	5-8%	10-15%
Zoom call, accented English	8-15%	15-25%
Coffee shop, ambient noise	12-20%	20-30%
Field recording, technical jargon	15-25%	25-40%

Word-error-rate by audio condition (lower is better)

For dissertation work, the right strategy is not to chase the lowest WER number. It is to choose a tool that exposes confidence per word, lets you quickly review the low-confidence spans against the original audio, and supports a clickable timestamp from any word back to the audio moment. That combination is what holds up under committee scrutiny: not "trust the AI," but "verify any word in two clicks."

Verbatim vs clean: the methodological choice the tool must let you make

Verbatim transcription preserves everything: every "um," "uh," false start, repetition, hesitation, laugh, sigh, and pause. Clean (or "intelligent") verbatim removes filler words, normalizes false starts, and produces a more readable transcript at the cost of fidelity. Different qualitative methodologies require different choices. Discourse analysis needs full verbatim — the pauses are the data. Thematic analysis usually wants clean verbatim because the meaning, not the cadence, is what gets coded.

The methodological mistake is letting the tool make this choice silently. Many AI transcription products default to clean verbatim and do not expose a toggle. If your methodology requires full verbatim and the tool stripped the fillers without telling you, your data is contaminated before analysis starts. The fix is straightforward: pick a tool that has an explicit verbatim/clean toggle per recording and document the setting in your methods chapter.

For mixed-methodology dissertations — say, three verbatim discourse-analysis recordings and twenty thematic-analysis interviews — make sure the toggle is per-recording, not project-global. Otherwise you will end up with the wrong setting on at least one batch.

IRB approvals and AI transcription

Every university IRB now asks whether AI transcription will be used and, if so, what the data flow looks like. Several universities have published explicit guidance: AI transcription must be disclosed in the data-management plan, the vendor must offer a written guarantee that participant audio will not be used to train models, and retention policies must align with the protocol's specified destruction date.

Concretely, the IRB reviewer wants answers to: where does the audio process (US, EU, on-device), who has access (vendor employees, sub-processors, no one), is the audio used to train models (yes, no, opt-out), can files be deleted on demand (yes within X hours, no, only at end of subscription), and what encryption is in place (TLS in transit, AES-256 at rest, both, neither). Tools designed for the researcher market answer these in writing on a public-facing page; tools designed for sales calls answer them in legal jargon buried in section 7.4 of the EULA.

Doctoral students who have to navigate this for the first time often underestimate how much friction comes from a vendor whose privacy posture does not survive an IRB reviewer's first read. Pick a tool whose IRB-relevant answers are documented before the first interview, not after.

Affordable for grad students

Doctoral students do not have research-ops budgets. A 50-hour dissertation interview corpus, transcribed by a human service at $1.50 per audio minute, costs $4500 — roughly half a stipend month. AI transcription brings that to under $50 for the same corpus. The savings are real, and they are the reason almost every dissertation written after 2024 uses AI transcription somewhere in the workflow.

That said, grad-student budgets still matter. Tools priced for enterprise — $30/seat/month, three-seat minimum — punish the solo dissertation researcher. Look for student or academic tiers, "free under N hours per month" plans, or per-minute pricing without seat minimums. Several vendors offer a 50% academic discount with a verifiable .edu email; ask before you pay full price.

One more honest note: do not buy the cheapest option if it cannot pass the IRB checklist. The savings on a non-compliant vendor evaporates the moment you have to redo your interviews because the IRB rejects your data-management plan. Cheap and IRB-friendly is the only target worth aiming at.

Citing AI-transcribed quotes

The methods chapter should disclose that AI transcription was used, name the tool and version, describe the verbatim/clean choice, and report the human-review process applied. A typical disclosure reads: "Audio recordings were transcribed using TigerScribe v2.1 in clean-verbatim mode. The author reviewed every transcript against the original audio, correcting attribution errors and low-confidence spans." That is enough for most committees.

For direct quotes in the dissertation body, include a timestamp reference (e.g., "P14, 23:14") so any reader — including the external examiner — can verify the quote against the recording. The audit trail is part of the rigor case for AI-transcribed qualitative data; do not hide it.

Non-English interviews and translation

Many dissertation projects involve interviews in languages other than English — fieldwork abroad, multilingual diaspora studies, or simply participants who are more comfortable in their native language. The transcription strategy here has two layers: source-language transcription (the verbatim audio in the spoken language) and translated transcription (the English version your committee will likely read). Both layers matter for rigor, and AI-assisted workflows can handle each, but they fail at different rates.

Source-language transcription accuracy varies widely by language. Spanish, French, German, and Mandarin sit in roughly the same accuracy band as English on modern STT engines. Less-resourced languages — Swahili, Tamil, Quechua, regional dialects — drop to 70-85% WER, which is workable for thematic analysis but unreliable for direct-quote extraction. Always pilot the tool on a sample recording in your target language before committing the entire study corpus.

Translation should always be a separate step from transcription, with the source-language transcript preserved as a primary document. Combining transcription and translation in one AI pass loses fidelity — small mistranscriptions compound into translation errors that no committee will catch. Best practice: AI-transcribe in the source language, human-review for accuracy, then translate (AI or human) into English with the source-language transcript visible alongside. Cite both versions in the methods chapter and store both in the data archive.

A workflow that scales from N=8 to N=80

That workflow holds for an N=8 pilot study and scales linearly to an N=80 multi-site dissertation. The bottleneck is always the same: the 15-minute human review, which you cannot skip without sacrificing rigor. Plan for it in your timeline. The savings from skipping a human transcription service still leave you weeks ahead — just do not pretend the AI is the only checker.

Keep reading

Transcription for dissertation interviews: what holds up to a defense

Why dissertation interviews are different from any other transcript

The accuracy threshold that survives a defense

Verbatim vs clean: the methodological choice the tool must let you make

IRB approvals and AI transcription

Affordable for grad students

Citing AI-transcribed quotes

Non-English interviews and translation

A workflow that scales from N=8 to N=80

The Speaker 1 problem: why every transcription tool fumbles who said what

Audio to text in 2026: a guide that actually accounts for accuracy, speakers, and privacy

Video to text: how to convert video to clean, usable transcripts without losing context