mcptube
MCP server that turns YouTube videos into a persistent, merging wiki rather than ephemeral vector chunks. Scene-change frame extraction + vision analysis captures slides, code, and diagrams that transcripts miss. 25+ MCP tools, FTS5+LLM hybrid retrieval, version history with source attribution per claim.
mcptube takes Karpathy’s “your agent maintains an LLM wiki” idea and points it at YouTube. Each video you ingest doesn’t go into a vector store as ephemeral chunks - it gets turned into entities, topics, and concepts that merge with everything already in the wiki. Watch ten talks about MCP and you end up with a coherent MCP article, not ten unconnected transcripts.
The other half of the design is treating video as video, not just transcript. Scene-change detection picks frames where something actually changed (a new slide, a code panel, a diagram), and a vision pass extracts the visual content most transcript-only tools miss.
The architecture worth understanding
Three components stacked:
- Ingest layer - downloads + transcribes + scene-change frame detection
- WikiEngine - merges new content into existing entities; keeps version history; preserves source attribution per claim
- MCP server - exposes 25+ tools for query, edit, and discovery
The merge step is what differentiates this from “another RAG over YouTube.” A vector chunk store with ten MCP talks gives you ten near-duplicate chunks. mcptube’s wiki gives you one MCP article with citations to ten sources - and when you ask “how do MCP servers handle auth?” it answers from the merged article, not from whichever chunk happens to score highest on cosine similarity.
Tools the MCP exposes
The headline subset:
add_video,list_videos,discover_videos- ingest and inventorywiki_list,wiki_show,wiki_search,wiki_ask- read-sideget_frame,classify_video,generate_report- the vision-aware bits- Plus a bunch more for batch ops, attribution lookup, and history
Hybrid retrieval: FTS5 keyword search narrows the candidate set, then the LLM reasons over the narrowed view. This is the right shape for a wiki - keyword search for “where does this concept live” plus LLM reasoning for “what does it actually say” - and avoids the embedding-only failure mode of confidently retrieving the wrong chunk.
Install
pipx install mcptube --python python3.12
mcptube --help
Requires Python 3.12+ and ffmpeg on the path. MCP client integrations for Claude Desktop, VS Code Copilot, Cursor, Windsurf, and Gemini CLI are wired up; standard MCP config blocks for the rest.
Why scene-change beats fixed-interval frame extraction
The lazy approach - sample one frame every 30 seconds - misses content. A talk where the speaker pulls up a code panel for 8 seconds, scrolls through three diffs, and switches back to slides will lose all three diffs to the sampling cadence.
Scene-change detection catches them. Visual content (code, slides, architecture diagrams, terminal output) survives into the wiki, which is the difference between “a transcript with timestamps” and “a knowledge base that knows what was on screen.”
When to reach for it
- You watch a lot of technical talks and want them to compound into something searchable. The merge-into-wiki design is the right shape for that.
- You ingest video where the visual matters - conference talks with slides, code walkthroughs, architecture reviews. Scene-change extraction is what makes those usable.
- You want an MCP that does something model-native rather than wrapping an existing API. The WikiEngine is the differentiated piece.
When not to
- You want to summarise a single video. mcptube is overkill for one-off summarisation; standard transcript tools or yt-dlp + an LLM call do that fine.
- Your videos are talking-head only (podcasts, interviews). Without visual content, you’re paying for vision that surfaces nothing.
- You need real-time results. Ingestion is batch - download, transcribe, scene-detect, merge - and the wiki compounds in value over time, not on the first video.
Trade-offs
The wiki merge is the value and also the failure mode. If the merge step gets a fact wrong, that wrong fact propagates - subsequent queries see the merged article, not the source. Version history is on by default, which lets you audit, but you do need to actually look. Trust-and-verify, especially on the first few videos.
ffmpeg + scene-change detection is heavier than transcript-only ingestion. A 90-minute talk takes real wall-clock time to process. Don’t expect “drop a URL, get answers in 10 seconds” - the ingest pipeline is where the latency lives.
The wiki is local. Persistent, but not multi-user out of the box; if you want a team-shared wiki you’re wiring sync up yourself. For a single operator that’s fine; for “the team’s video knowledge base” it’s a project.
Similar tools
- claude-memory-compiler
Hooks capture Claude Code sessions, the Agent SDK extracts decisions and lessons, and an LLM compiler organizes them into cross-referenced knowledge articles. Memory that grows with the repo.
- PostTrainBench
Benchmark measuring whether Claude Code, Codex CLI, Gemini CLI, and OpenCode can autonomously improve 4 small base models (Qwen3-1.7B/4B, SmolLM3-3B, Gemma-3-4B) on 7 evals (AIME, BFCL, GPQA, GSM8K, HealthBench, HumanEval, Arena Hard) within a single H100 GPU and 10 hours. Includes agent-as-judge anti-reward-hacking and baseline-replacement penalties for tampering.
- trace-mcp
MCP server with 138 tools and cross-language framework awareness (58 integrations across 81 languages). Indexes Laravel/Inertia/Vue, Rails/Hotwire, Django/HTMX edges so agents skip re-deriving call graphs. Decision memory links architectural choices to the code they're about. Local-first ONNX embeddings, optional LSP enrichment.
- google-docs-mcp
MCP server for Google Docs that uses pattern-matching search-and-replace (like file editors) instead of character offsets, which LLMs are notoriously bad at. Fixes the broken existing options.