Qualitative research
Transcription for qualitative research: what UX teams keep getting wrong
Generic transcription tools betray researchers in ways that only show up at the analysis stage. Here is what UX teams keep getting wrong, and the stack that fixes it.
Why generic transcription tools betray researchers
Most transcription tools are built for meetings — short, structured, with formal turn-taking. Research interviews break that mold. Sessions are 45-90 minutes. There are often 2-4 people on the moderator side and 1-3 participants. Cross-talk is normal — moderators jump in, participants tell stories with overlapping clarifying questions. Generic tools were not trained on this material, and they fail predictably.
Generic tool defaults
- Speaker labels reset every file
- Cross-talk collapses attribution
- Cannot search across studies
- Loses context at 60+ min length
- No tagging or annotation surface
What researchers need
- Persistent speaker IDs across sessions
- Robust handling of conversational overlap
- Search across the entire participant pool
- Stable accuracy on long-form sessions
- Native annotation and code surface
Longitudinal studies and the persistent-speaker problem
Longitudinal research — multiple sessions with the same participant over weeks or months — is where the persistent-speaker problem hits hardest. Without persistent voice IDs, each session relabels Participant 4 from scratch, and you end up renaming the same person manually for every session over the course of a 12-week study.
With persistent voiceprints, the math reverses. You enroll a participant once at the start of week one. Every subsequent session, recorded weeks or months apart, automatically attributes their voice. Multiply that by 8 participants in a longitudinal study and you have eliminated 80% of the relabeling work that has historically been baked into research ops.
Coding, themes, and the “search every interview” workflow
Once your transcripts are clean and speaker-attributed, the analytical workflow that emerges is "search across the entire interview corpus." Instead of asking "what did Participant 4 say about pricing?" — which is what the manual coding workflow asks — you can ask "what did anyone say about pricing across all 47 interviews?" That is a different category of insight, and it depends entirely on having a clean, searchable, speaker-attributed corpus.
The tools that close this loop pair transcription with thematic search and coding. Dovetail, Marvin, EnjoyHQ, Notably are the established players. Each has different strengths, but they all share a dependency: the input transcripts must be clean and attributed, or the analysis surfaces are useless. Transcription quality is upstream of analytical quality, full stop.
Privacy posture for participant data
Research participants are a regulated audience. IRBs at universities, ethics boards at clinical research orgs, and increasingly procurement at corporate research teams require clear answers to four questions: where is the data stored, who has access, how long is it retained, and is it used for any purpose other than the research it was collected for. Transcription vendors that cannot answer all four cleanly will fail any serious review.
- Audio retention: short-by-default, with a published policy. 30 days is a reasonable maximum for research workflows.
- Voiceprints: opt-in at the participant level, with a one-click delete and audit trail.
- Model training: the vendor must commit to not training on participant audio, in writing.
- BAAs: required for any work touching health-related research, even adjacent (mental health, accessibility, chronic conditions).
- Subprocessor list: every party with access to the audio should be named — your IRB will ask.
A pragmatic stack we would actually run
| Layer | Tool | Why |
|---|---|---|
| Capture | Zoom Pro (per-track) | Isolated tracks make diarization trivial |
| Transcription | Voice-ID-first tool | Persistent speakers, low DER, BAA |
| Analysis | Dovetail / Notably | Speaker-aware coding and themes |
| Archive | Encrypted research storage | Retention controls, access logs |
| Export | Markdown + CSV | Future-proof, easy migration |
That stack costs less than most teams currently spend on per-minute human transcription, eliminates the cleanup tax, and survives any IRB review. It is also boringly composable — if you need to swap any layer, the data flows are clean. That is the right shape for research infrastructure: small, replaceable parts, well-defined contracts.
Keep reading
Speaker Identification
The Speaker 1 problem: why every transcription tool fumbles who said what
9 min →
Audio to Text
Audio to text in 2026: a guide that actually accounts for accuracy, speakers, and privacy
10 min →
Video to Text
Video to text: how to convert video to clean, usable transcripts without losing context
9 min →