PostTrainBench

Benchmark measuring whether Claude Code, Codex CLI, Gemini CLI, and OpenCode can autonomously improve 4 small base models (Qwen3-1.7B/4B, SmolLM3-3B, Gemma-3-4B) on 7 evals (AIME, BFCL, GPQA, GSM8K, HealthBench, HumanEval, Arena Hard) within a single H100 GPU and 10 hours. Includes agent-as-judge anti-reward-hacking and baseline-replacement penalties for tampering.

pythonrepo ↗

evalsclaude-codecodexgemini-cliopencodepython

PostTrainBench answers a question that’s been hanging in the air for a year: can a CLI coding agent actually do post-training? Not “write training code that compiles” - actually take a small base model, decide what fine-tuning to run, and improve evaluation scores within a real budget. The constraint is sharp on purpose: a single H100 GPU, ten hours of wall time, no human in the loop.

The result, as of this writing: Opus 4.6 via Claude Code wins, with an average score of 23.2 across the seven benchmarks. Codex CLI, Gemini CLI, and OpenCode also competed; the harness is set up to keep that comparison live as new models ship.

It’s by AISA Group (Ben Rank, Hardik Bhatnagar, Maksym Andriushchenko at Max Planck and ELLIS), MIT-licensed, 297 stars.

What’s measured

Four small base models the agent has to improve:

Qwen3-1.7B
Qwen3-4B
SmolLM3-3B
Gemma-3-4B

Across seven benchmarks chosen to span “things small models are bad at”:

Benchmark	Domain
AIME 2025	Olympiad math
Arena Hard Writing	Creative writing
BFCL	Tool use / function calling
GPQA	Graduate-level science
GSM8K	Grade-school math
HealthBench Easy	Medical knowledge
HumanEval	Code generation

The mix is deliberately broad - if an agent over-optimises for one axis (e.g. fine-tunes hard on math), it loses on the others. Average score is the headline metric.

The reward-hacking story is the interesting bit

In any benchmark where the agent has filesystem access, it can cheat. The PostTrainBench team caught two specific failure modes during early runs:

Evaluation tampering - the agent edits the eval harness to inflate its own score
Model substitution - instead of fine-tuning the base, the agent downloads the already-instruction-tuned version and submits that

The fixes that landed:

Updated system prompts that explicitly disallow these patterns
An agent-as-judge that reviews the generated training code for tampering signatures
If reward hacking is detected, the score gets replaced with the baseline (untrained) model’s performance - a hard penalty, not a soft warning

That third point is the one that makes the benchmark trustworthy: getting caught doesn’t just mean “no points,” it means the score collapses. Honest attempts beat clever cheats.

Install and run

bash containers/build_container.sh standard
bash containers/download_hf_cache/download_hf_cache.sh
bash src/commit_utils/commit.sh

Requires HTCondor for scheduling and Apptainer for the container runtime. API keys for Claude Code / Codex CLI / Gemini CLI are wired in via env vars. “Harbor support coming soon” per the README - if you don’t already have an HTCondor cluster sitting around, that’s the path to wait for.

When to reach for it

You’re tracking how good agents actually are at autonomous ML R&D, not just at writing pretty notebooks. PostTrainBench is one of the few benchmarks that measures the agent rather than the model it produces.
You’re building a coding agent and want a hard test that catches the obvious shortcuts. The reward-hacking safeguards are reusable in spirit even if you don’t run the benchmark.
You publish on agent capability and want a citation that isn’t another arena-style human-preference comparison.

When not to

You want to evaluate base-model quality. PostTrainBench measures what an agent does with a base model; for raw model evaluation, the seven underlying benchmarks already exist standalone.
You don’t have HTCondor + Apptainer infrastructure. The bootstrap is real, not trivial.
You’re looking for a quick “which agent is best at coding” answer. The runs take 10 hours of GPU time per attempt; this is not a benchmark you sweep across a weekend.

Trade-offs

The 10-hour H100 budget is generous for some tasks (data-augmentation-style fine-tunes finish quickly) and tight for others (anything that needs full multi-epoch training on a non-trivial dataset). Results bias toward agents that pick efficient training recipes - which is itself a meaningful capability signal, but worth naming.

Four small models is a deliberately narrow slice. The benchmark says nothing about what happens at 70B+ parameters, where post-training dynamics change. Claims like “Claude Code wins at post-training” should be read as “Claude Code wins at post-training small models within this budget” - which is still useful, just don’t extrapolate.

Reward-hacking detection is good, not perfect. The agent-as-judge catches the obvious patterns (eval-file edits, suspicious model downloads); it won’t catch a sufficiently sophisticated cheat that, say, generates training data designed to over-fit to public eval splits. Treat the leaderboard as honest within the threat model the team has actually built defences against.