mcptube

MCP server that turns YouTube videos into a persistent, merging wiki rather than ephemeral vector chunks. Scene-change frame extraction + vision analysis captures slides, code, and diagrams that transcripts miss. 25+ MCP tools, FTS5+LLM hybrid retrieval, version history with source attribution per claim.

pythonrepo ↗

mcpclaude-codeknowledge-graphagent-memorypython

mcptube takes Karpathy’s “your agent maintains an LLM wiki” idea and points it at YouTube. Each video you ingest doesn’t go into a vector store as ephemeral chunks - it gets turned into entities, topics, and concepts that merge with everything already in the wiki. Watch ten talks about MCP and you end up with a coherent MCP article, not ten unconnected transcripts.

The other half of the design is treating video as video, not just transcript. Scene-change detection picks frames where something actually changed (a new slide, a code panel, a diagram), and a vision pass extracts the visual content most transcript-only tools miss.

The architecture worth understanding

Three components stacked:

Ingest layer - downloads + transcribes + scene-change frame detection
WikiEngine - merges new content into existing entities; keeps version history; preserves source attribution per claim
MCP server - exposes 25+ tools for query, edit, and discovery

The merge step is what differentiates this from “another RAG over YouTube.” A vector chunk store with ten MCP talks gives you ten near-duplicate chunks. mcptube’s wiki gives you one MCP article with citations to ten sources - and when you ask “how do MCP servers handle auth?” it answers from the merged article, not from whichever chunk happens to score highest on cosine similarity.

Tools the MCP exposes

The headline subset:

add_video, list_videos, discover_videos - ingest and inventory
wiki_list, wiki_show, wiki_search, wiki_ask - read-side
get_frame, classify_video, generate_report - the vision-aware bits
Plus a bunch more for batch ops, attribution lookup, and history

Hybrid retrieval: FTS5 keyword search narrows the candidate set, then the LLM reasons over the narrowed view. This is the right shape for a wiki - keyword search for “where does this concept live” plus LLM reasoning for “what does it actually say” - and avoids the embedding-only failure mode of confidently retrieving the wrong chunk.

Install

pipx install mcptube --python python3.12
mcptube --help

Requires Python 3.12+ and ffmpeg on the path. MCP client integrations for Claude Desktop, VS Code Copilot, Cursor, Windsurf, and Gemini CLI are wired up; standard MCP config blocks for the rest.

Why scene-change beats fixed-interval frame extraction

The lazy approach - sample one frame every 30 seconds - misses content. A talk where the speaker pulls up a code panel for 8 seconds, scrolls through three diffs, and switches back to slides will lose all three diffs to the sampling cadence.

Scene-change detection catches them. Visual content (code, slides, architecture diagrams, terminal output) survives into the wiki, which is the difference between “a transcript with timestamps” and “a knowledge base that knows what was on screen.”

When to reach for it

You watch a lot of technical talks and want them to compound into something searchable. The merge-into-wiki design is the right shape for that.
You ingest video where the visual matters - conference talks with slides, code walkthroughs, architecture reviews. Scene-change extraction is what makes those usable.
You want an MCP that does something model-native rather than wrapping an existing API. The WikiEngine is the differentiated piece.

When not to

You want to summarise a single video. mcptube is overkill for one-off summarisation; standard transcript tools or yt-dlp + an LLM call do that fine.
Your videos are talking-head only (podcasts, interviews). Without visual content, you’re paying for vision that surfaces nothing.
You need real-time results. Ingestion is batch - download, transcribe, scene-detect, merge - and the wiki compounds in value over time, not on the first video.

Trade-offs

The wiki merge is the value and also the failure mode. If the merge step gets a fact wrong, that wrong fact propagates - subsequent queries see the merged article, not the source. Version history is on by default, which lets you audit, but you do need to actually look. Trust-and-verify, especially on the first few videos.

ffmpeg + scene-change detection is heavier than transcript-only ingestion. A 90-minute talk takes real wall-clock time to process. Don’t expect “drop a URL, get answers in 10 seconds” - the ingest pipeline is where the latency lives.

The wiki is local. Persistent, but not multi-user out of the box; if you want a team-shared wiki you’re wiring sync up yourself. For a single operator that’s fine; for “the team’s video knowledge base” it’s a project.