YouTube to notes
How to convert YouTube videos into structured notes without watching them twice
Long-form video is the worst format for retention. A good YouTube-to-notes pipeline turns 90 minutes of watching into 5 minutes of skimmable, searchable notes — without losing the gold.
Why you keep re-watching the same video
Long-form YouTube video is one of the worst formats ever invented for information retention. Lectures average 8% recall after a week. Conversational podcasts average less. The reason you keep going back to the same Andrej Karpathy video, the same Lex Fridman interview, is not that the material was unmemorable — it is that the format makes retention almost impossible without external scaffolding.
The fix is to treat the video as a source and turn it into something you can actually skim. That something is a structured notes document — outline, key claims, named speakers, timestamps for the moments you will want to revisit. The tooling to produce this is now cheap and fast. The bottleneck has shifted from production to whether your tool produces notes you trust.
8%
Lecture recall
After 7 days, no notes
5x
Skim speed
Notes vs. video
90 min
Average video length
Long-form content
The three layers of a good notes workflow
Flat transcript
- 90 minutes as a wall of words
- No structure, no outline
- Speakers labeled "Speaker 1, 2"
- Easier to search, hard to study
Structured notes
- Outline with skimmable headings
- Bullet-pointed key claims
- Named speakers + timestamps
- Linked back to source for verification
Notes that work for retention have three layers underneath the surface: a clean transcript, an outline that compresses each ten-minute chunk into a headline, and a key-claims pass that surfaces the 5-15 statements you will actually want to remember. The first two are now generated automatically by most tools. The third is where quality varies wildly.
Step-by-step: from URL to outline
- 01Paste the YouTube URL into your tool, or download the video and audio if your tool ingests files.
- 02Run transcription with diarization on. Confirm speaker labels in the first 90 seconds — that is when speakers tend to introduce themselves.
- 03Generate an outline pass: 5-9 chapter headings with timestamps. Edit headings ruthlessly — the goal is skimmable, not exhaustive.
- 04Run a key-claims pass: 8-15 bullet points across the whole video. Each one should be a statement, not a topic.
- 05Tag any "I need to come back to this" moments with timestamps. These are the 5-second loops you will revisit weeks later.
- 06Export to your notes app of choice with a link back to the YouTube URL and timestamps preserved.
Speaker labels for multi-host videos
For interview podcasts, panel discussions, and shows with recurring guests, speaker labels are not optional. Half the value of an interview transcript is "what did the guest say versus what did the host say" — collapse that distinction and you have a wall of opinions with no source attribution.
Tools that maintain persistent voice IDs across videos are dramatically more useful here. Once you have transcribed two episodes of "Acquired" or three Lex Fridman interviews, the recurring host is already named. New guests get tagged on first appearance and stay named for every future episode they appear in. That compounds — your notes archive gets more useful the more videos you process.
Exporting to Notion, Obsidian, and Apple Notes
| Notes app | Best format | Speaker tags? | Timestamps as links? |
|---|---|---|---|
| Notion | Markdown or direct API | Yes (callouts) | Yes (web links) |
| Obsidian | Markdown with YAML | Yes (formatted) | Yes |
| Apple Notes | Rich text | Limited | Limited |
| Roam / Logseq | Markdown blocks | Yes | Yes |
| Google Docs | .docx | Yes | Yes (hyperlinks) |
One export practice that pays off forever: include the original YouTube URL with the timestamp in every key claim. When future-you wants to verify a quote or re-listen to a 30-second segment, you click the link and you are at the exact moment. This single pattern is the difference between notes that age well and notes that become unverifiable folklore.
Studying lectures vs. consuming explainers
Two different goals call for two different tunings of the same workflow. Studying a lecture means you are going to come back, you want full claim-level extraction, and you want OCR on slides. Consuming an explainer means you want the takeaways, you are not going back, and a tight 8-bullet summary is enough. Same tool, different settings.
- Study mode: full transcript + outline + claims + slide OCR + linked timestamps.
- Consume mode: outline + 8-15 bullet takeaways + a single "save for later" timestamp.
- Reference mode (for the videos you cite in writing): full transcript with speaker tags, ready to drop a quote into a draft.
Keep reading
Speaker Identification
The Speaker 1 problem: why every transcription tool fumbles who said what
9 min →
Audio to Text
Audio to text in 2026: a guide that actually accounts for accuracy, speakers, and privacy
10 min →
Video to Text
Video to text: how to convert video to clean, usable transcripts without losing context
9 min →