Rapid-MLX
MLX-native inference engine with OpenAI-compatible API. The novel piece: DeltaNet state snapshots bring prompt caching to non-trimmable architectures (Qwen3.5 hybrids), restoring RNN state in ~0.1ms. 2-5x faster TTFT, native Metal kernels, continuous batching.
Rapid-MLX is a local inference engine for Apple Silicon that exposes an OpenAI-compatible API and runs LLMs 2-4x faster than Ollama or llama.cpp on the same hardware. The headline trick - and the part worth understanding before the install steps - is DeltaNet state snapshots: a prompt-caching technique designed for hybrid RNN-attention architectures (Qwen3.5 hybrids and the like) that previously couldn’t be cached at all.
For non-trimmable architectures, traditional prompt caches don’t work because there’s no contiguous KV-cache prefix you can lop off. Rapid-MLX snapshots the RNN state at prompt boundaries and restores it in ~0.1ms. The README claims “the first technique to bring prompt cache to non-trimmable architectures on MLX” - which isn’t a claim Ollama or llama.cpp can match today.
The numbers worth quoting
- 2-4x speedup vs Ollama and llama.cpp on Apple Silicon
- 2-5x faster TTFT (time-to-first-token) across architectures via state snapshots
- ~0.1ms state restore on hybrid models
- 607 GitHub stars at time of writing - the local-LLM-on-Mac space is crowded, this one stands out
Why it’s actually faster
Three things stack:
- DeltaNet snapshots for hybrid RNN-attention models - the novel piece
- Native Metal compute kernels via Apple’s MLX framework, built specifically for unified memory (no Metal-shader-meets-CUDA-shaped-API impedance mismatch)
- Continuous batching + optimized prefill chunking - standard inference-stack tricks, but tuned
If you’ve been running llama.cpp on a Mac because it’s the default, Rapid-MLX is the first project that’s a clear capability upgrade rather than an alternative.
Install
Three paths, pick one:
# Homebrew
brew install raullenchai/rapid-mlx/rapid-mlx
# pip
pip install rapid-mlx
# automated installer
curl -sSfL https://raw.githubusercontent.com/raullenchai/Rapid-MLX/main/install.sh | bash
Requires Python 3.10+. Once running, point any OpenAI-compatible client at the local endpoint - Cursor, Claude Code, Aider, PydanticAI, LangChain are all called out as tested integrations.
When to reach for it
- You’re on Apple Silicon and your inference workload is large enough that 2-4x matters - long-context coding agents, repeated document processing, anything that re-prompts the same prefix.
- You’re running a Qwen3.5 hybrid or any RNN-attention model and have given up on prompt caching. Rapid-MLX is the path that gets it back.
- You want a drop-in OpenAI endpoint for local dev without changing client code. The compatibility layer is the boring-but-correct part.
When not to
- You’re on a non-Apple machine. MLX is unified-memory-first; on a discrete-GPU box, llama.cpp or vLLM are the right calls.
- Your bottleneck is model quality, not throughput. A faster runtime doesn’t change what the model knows.
- You need batched serving for many users. The continuous batching is solid, but production multi-tenant inference is a different problem class - look at vLLM or TGI.
Trade-offs
The DeltaNet snapshot technique is specific to hybrid architectures. For pure-attention models (most of the Qwen3 lineup, Llama, Mistral) the gains come from Metal kernels and prefill tuning - still real, but not the dramatic 5x TTFT figure.
The MLX dependency is a feature on Mac and a wall everywhere else. If your team mixes Mac and Linux dev environments, you can’t standardise on Rapid-MLX without parallel infrastructure.
The OpenAI-compatible layer covers chat completions and basic streaming. If your client uses non-standard fields (function calling shape varies across providers), check that the round-trip behaves the way you expect before betting a workflow on it.
Similar tools
- vulnhawk
Static analysis scanner that finds auth bypass, IDOR, and business logic bugs that Semgrep and CodeQL miss. Ships as a free GitHub Action covering Python, JS/TS, Go, PHP, and Ruby.
- Claudraband
Wraps the real Claude Code TUI with a session lifecycle layer. Resumable non-interactive workflows, HTTP daemon for remote/headless control, ACP server for editor integrations (Zed, Toad). Drives your existing Claude Code install rather than reimplementing it - keeps skills, hooks, MCPs, and approvals intact.
- mcptube
MCP server that turns YouTube videos into a persistent, merging wiki rather than ephemeral vector chunks. Scene-change frame extraction + vision analysis captures slides, code, and diagrams that transcripts miss. 25+ MCP tools, FTS5+LLM hybrid retrieval, version history with source attribution per claim.
- PostTrainBench
Benchmark measuring whether Claude Code, Codex CLI, Gemini CLI, and OpenCode can autonomously improve 4 small base models (Qwen3-1.7B/4B, SmolLM3-3B, Gemma-3-4B) on 7 evals (AIME, BFCL, GPQA, GSM8K, HealthBench, HumanEval, Arena Hard) within a single H100 GPU and 10 hours. Includes agent-as-judge anti-reward-hacking and baseline-replacement penalties for tampering.