awesome-harness-engineering
Curated awesome list for AI agent harness engineering: tools, patterns, evals, memory, MCP, permissions, observability, and orchestration.
There are many “awesome AI agent” lists. Most of them collect frameworks. This one collects something more specific and more useful: the harness - the scaffolding around the model that determines whether an agent actually works on real tasks.
The framing the maintainers picked is the right one: harness engineering is the discipline of designing context delivery, tool interfaces, planning artifacts, verification loops, memory systems, and sandboxes. Models can’t do these things alone, and the best harnesses are designed knowing those components will become unnecessary as models improve. That insight is what keeps the list from becoming a frameworks dump.
How it’s organized
The list is structured around the problem each component solves, not the vendor that built it. The top-level sections:
- Foundations - canonical essays from OpenAI, Anthropic, Google, Microsoft, Meta, Red Hat, LangChain, Martin Fowler that define what harness engineering actually is.
- Design Primitives - the components a harness is composed of:
- Agent Loop - ReAct, LangGraph, the Codex agent loop, middleware
- Planning & Task Decomposition - Plan.md / Implement.md patterns, plan-and-execute, multi-agent topologies
- Context Delivery & Compaction - what the agent sees, when, and how it shrinks
- Tool Design - schemas, naming, error surfaces (the “tool design is agent UX” school)
- Skills & MCP - protocol-level integration
- Permissions & Authorization - structured permission systems vs natural-language prompts
- Memory & State - episodic, long-term, cross-session
- Task Runners & Orchestration - the pieces that drive multiple agents
- Verification & CI Integration - getting the agent to check its own work
- Observability & Tracing - knowing what happened
- Debugging & Developer Experience - inspecting the trace
- Human-in-the-Loop - approval flows and intervention triggers
- Reference Implementations - tutorials, generators/meta-harnesses, demo harnesses, adjacent collections.
- Security, Sandbox & Permissions - the layer most teams under-invest in until it bites.
- Evals & Verification - measuring what you’ve built.
- Templates - drop-in artifacts.
Each entry is annotated with what makes it worth reading, not just a one-line description. This is the part that’s hard to maintain and the reason the list is more useful than a search-engine query for the same terms.
Why “harness engineering” as a separate discipline
Three pieces of writing in the Foundations section explain it best:
- OpenAI’s “Harness Engineering” - the framing piece. Defines harness engineering as the design of the scaffolding that lets agents operate reliably.
- Martin Fowler’s synthesis - reframes the discipline as three interlocking systems: context engineering (curating what the agent knows), architectural constraints (deterministic linters and structural tests), and entropy management (periodic agents that repair documentation drift). The “humans on the loop” framing is the clearest conceptual map of what the discipline actually is.
- LangChain’s “Anatomy of an Agent Harness” - structural breakdown into five primitives: filesystem, code execution, sandbox, memory, context management. Includes the co-evolution warning: models trained against specific harnesses can become overfitted to those designs - a reason architecture choices have lasting consequences.
If you’ve felt the difference between “the model is smart but the agent is unreliable” and “the model is the same and now the agent works,” you’ve been doing harness engineering whether you called it that or not.
The papers worth your time
The list pulls together a surprising amount of recent peer-reviewed and industry research:
- “Building AI Coding Agents for the Terminal” - the first systematic practitioner paper on terminal-native coding agent harness design. Eager-construction scaffolding, compound multi-model architectures, schema-filtered planning subagents.
- “A Scheduler-Theoretic Framework for LLM Agent Execution” (April 2026) - 70 open-source agent projects analysed; 60% adopt the Agent Loop pattern. Maps execution patterns onto a unified control model so the trade-offs become explicit.
- “The Design Space of Today’s and Future AI Agent Systems” - reverse-engineering of Claude Code: five-stage progressive compaction, subagent isolation with rebuilt permission contexts, 27-event-type hook pipeline.
- “Improving Deep Agents with Harness Engineering” (LangChain case study) - harness-only changes moved their coding agent from rank 30 to top 5 on Terminal Bench 2.0 with no model swap. The strongest published demonstration that harness design is the primary performance lever.
- Microsoft’s Azure SRE Agent - 35,000+ production incidents handled autonomously, time-to-mitigation cut from 40.5 hours to 3 minutes. Most data-backed production case study published in 2026.
That’s not an exhaustive sample. It’s the kind of mix the list is good at.
When to use it
- You’re building a serious agent system and want to know what’s been tried before you spend a quarter rediscovering it.
- You’re catching up after a few months away - the list moves fast and the recent additions are usually the most interesting.
- You’re hiring or onboarding for harness work and want a reading list that isn’t “skim our docs.”
When it’s not the right resource
- You want a quick API tutorial. This is a depth resource, not a how-to.
- You’re looking for marketing-style recommendations between specific frameworks. The list deliberately classifies by problem solved, not vendor.
Practical notes
CC0 licensed. The maintainers actively curate - check the commit log to confirm freshness on whatever section you’re reading. Translations exist in nine languages on zdoc.app. The list is hosted on GitHub with the standard awesome-list contribution path: open a PR with the entry and a real annotation, not a one-liner.
If you only have time for one resource on this page, start with Anthropic’s “Harness Design for Long-Running Application Development” or OpenAI’s “Unrolling the Codex Agent Loop.” Either gives you a vocabulary you’ll keep using.
Similar tools
- mcptube
MCP server that turns YouTube videos into a persistent, merging wiki rather than ephemeral vector chunks. Scene-change frame extraction + vision analysis captures slides, code, and diagrams that transcripts miss. 25+ MCP tools, FTS5+LLM hybrid retrieval, version history with source attribution per claim.
- PostTrainBench
Benchmark measuring whether Claude Code, Codex CLI, Gemini CLI, and OpenCode can autonomously improve 4 small base models (Qwen3-1.7B/4B, SmolLM3-3B, Gemma-3-4B) on 7 evals (AIME, BFCL, GPQA, GSM8K, HealthBench, HumanEval, Arena Hard) within a single H100 GPU and 10 hours. Includes agent-as-judge anti-reward-hacking and baseline-replacement penalties for tampering.
- trace-mcp
MCP server with 138 tools and cross-language framework awareness (58 integrations across 81 languages). Indexes Laravel/Inertia/Vue, Rails/Hotwire, Django/HTMX edges so agents skip re-deriving call graphs. Decision memory links architectural choices to the code they're about. Local-first ONNX embeddings, optional LSP enrichment.
- claude-memory-compiler
Hooks capture Claude Code sessions, the Agent SDK extracts decisions and lessons, and an LLM compiler organizes them into cross-referenced knowledge articles. Memory that grows with the repo.