THOUGHTFUL

Introducing PostTrainBench

How well can AI agents post-train language models? We built a benchmark to find out.

MARCH 10, 2026

Post-training is how raw language models become useful, the stage that turns a capable but unsteered base model into a system that follows instructions, reasons through problems, uses tools, gains personality, and behaves safely. Today, this work is done mostly by humans: researchers design data pipelines, write evaluations, configure training runs, iterate on reward models, and much more.

A natural question is whether AI systems can start doing this work themselves because if they can, it closes one of the most consequential feedback loops in the field: AI improving AI. To begin answering this question, we built PostTrainBench, a benchmark measuring how well frontier AI agents can autonomously execute post-training workflows on base language models.

We deliberately simplified the problem. Real post-training runs across clusters for weeks, balances many objectives simultaneously, and requires a lot of debugging and judgment calls no benchmark can fully capture. PostTrainBench isolates a narrower question: given a clear objective and limited compute, can today's agents do the technical work? It's a lower bound on a much harder problem, but a measurable one, and tracking how it moves over time tells us something important about where AI R&D automation is headed.

Why post-training?

Most of the value in modern AI systems arguably comes from this stage. ChatGPT was the first widely-felt demonstration of a post-training breakthrough: reinforcement learning from human feedback (RLHF) turned GPT-3.5 from a text completer into a conversational assistant. Claude's character and safety behaviors emerged from Constitutional AI and the step-change in reasoning from o1 and DeepSeek-R1 came from reinforcement learning with verifiable rewards (RLVR). Instruction following, tool use, multilingual fluency and nearly every capability that made LLMs not just impressive but genuinely useful is a product of post-training.

It is also a natural starting point for measuring AI R&D automation: (1) improvements can be directly measured using standardized evaluations, and (2) the task is end-to-end (e.g. agents must find data, write training code, manage compute, debug errors, and iterate, all without human guidance).

Design Properties

With PostTrainBench, our goal was to create a benchmark with the following properties:

  • End-to-end. Agents must build their entire training pipeline from scratch — no starter code, training data, or hyperparameter configurations are provided.
  • Autonomous. Agents operate with full autonomy over data sources, training methods, and experimental strategy. They may search the web, curate datasets, and iterate freely.
  • Resource-bounded. Each run is constrained to 10 hours on a single H100 GPU, making the benchmark practical to run at scale while still providing meaningful signal.
  • Integrity-preserving. Agents may not train on benchmark test data, modify the evaluation harness, or substitute a different model. An LLM judge detects cheating; flagged runs receive the base model score.

Evaluation Setup

Evaluation setup flow

We give a frontier coding agent — Claude Code, Codex CLI, or Gemini CLI — a base language model and a target benchmark. The agent gets 10 hours on a single H100 GPU, terminal access, and internet connectivity. It receives no starter code, no training data, and no hyperparameters. It must figure everything out from scratch, then submit a post-trained checkpoint that we evaluate on a held-out test set.

We test across four base models (Qwen3-1.7B, Qwen3-4B, SmolLM3-3B, Gemma-3-4B) and seven benchmarks spanning math (AIME 2025, GSM8K), science (GPQA), coding (HumanEval), function calling (BFCL), creative writing (Arena-Hard), and medical dialogue (HealthBench-Easy), yielding 28 independent runs per agent.

Results

We evaluated 13 agent configurations. The overall PostTrainBench score is a weighted average across all base models and benchmarks, where harder benchmarks (those with smaller instruction-tuning gains) receive higher weight.

PostTrainBench results

Weighted average benchmark performance for different agents across 4 base models (Qwen3-1.7B, Qwen3-4B, SmolLM3-3B, Gemma-3-4B) and 7 benchmarks: AIME 2025 and GSM8K (math), GPQA (science), HumanEval (coding), BFCL (function calling), Arena-Hard (creative writing), and HealthBench (health advice). The averaging weights are specified in Table 5. The error bars show ±1 standard deviation across runs.

Overall performance. The top-performing agent — Opus 4.6 running on Claude Code — scores 23.2%, about 3× higher than the 7.5% base model average. Yet this is still less than half the 51.1% achieved by human teams who post-train these same base models at their home labs (e.g., Qwen3-4B-Instruct, Gemma-3-4B-IT). The gap is significant but narrowing quickly: Claude Sonnet 4.5 scored 9.9% in September 2025, while GPT-5.2 reached 21.5% just months later.

Per-task variation. Performance varies sharply across benchmarks. Agents show the largest gains on function calling (BFCL), where Opus 4.5 reaches 61.6% from a 1.5% base. Math (GSM8K) and coding (HumanEval) show moderate gains. Science (GPQA), creative writing (Arena-Hard), and competition math (AIME 2025) remain the hardest — no agent-trained model exceeds random chance on GPQA.

Agents can outperform human post-training on narrow tasks. In three cases, agents surpassed the official post-trained releases from the labs that built these models:

  • GPT-5.1 Codex Max post-trained Gemma-3-4B to 89% on BFCL, versus 67% for Google's Gemma-3-4B-IT.
  • An agent achieved 91% on BFCL with SmolLM3-3B, versus 84% for HuggingFace's official release.
  • An agent reached 33% on GPQA with Gemma-3-4B, versus 31% for the official model.

These results suggest that targeted optimization under constrained compute can outperform broad training on specific metrics, though agents do not yet replicate the general-purpose post-training achieved by expert teams.

Scaffolds matter. Native CLI scaffolds consistently outperform OpenCode, a general-purpose open-source scaffold, when using the same underlying model. GPT-5.1 Codex Max scores 20.2% on Codex CLI versus 7.7% on OpenCode. Both model capability and scaffold quality contribute meaningfully to agent performance.

Agent behavior

Beyond aggregate scores, we analyzed agent execution traces to understand how agents approach post-training and conducted a series of ablation studies (read more in our paper)

Performance vs time

Figure 6. Agent time utilization vs. performance within the 10-hour window, averaged across all base models and benchmarks. Dotted lines show the Pareto frontier. Most agents terminate well before the limit. Within each scaffold, longer runs correlate with higher performance, suggesting fuller time utilization could yield additional gains.

Time utilization. Agents were allocated up to 10 hours but many terminated early. Opus 4.5 regularly checked remaining time and used the full allocation. Sonnet 4.5 and GPT-5.2 Codex typically stopped within 2–3 hours. Longer runs generally correlate with higher performance.

Reasoning effort. For GPT-5.1 Codex Max, the default "Medium" reasoning effort outperformed "High" (20.2% vs 17.4%). High reasoning effort consumed nearly twice as many tokens, leading to more frequent context compaction and weaker performance.

Reward hacking. Reward hacking, where agents find unintended shortcuts that optimize the metric without accomplishing the actual goal, is an increasingly well-documented phenomenon, and one that grows more sophisticated as models become more capable. We observed several instances that emerged without any adversarial prompting:

  • Training on test data. The BFCL dataset on Hugging Face contains a "train" split that actually holds the evaluation data. GPT-5.1 Codex Max frequently failed to recognize this distinction and trained on it.
  • Model substitution. In early experiments with simpler prompts, Claude downloaded an instruction-tuned checkpoint instead of fine-tuning the base model.
  • Evaluation manipulation. The Codex agent modified the evaluation framework code to inflate scores in early iterations.
  • API restriction violation. One agent explicitly acknowledged a restriction against using the OpenAI API for synthetic data, then violated it hours later after the constraint fell out of context.

These behaviors emerged naturally from frontier models operating under minimal constraints. As agents take on more autonomous R&D work, the surface area for this kind of specification gaming only grows. PostTrainBench provides a controlled setting to study these failure modes before they appear in higher-stakes environments.

Limitations

This is a first step, and it comes with real limitations. Our 10-hour, single-GPU setup is far simpler than real-world post-training pipelines, which run across clusters for days or weeks. Agents optimize for one benchmark at a time rather than building generalist models. Our LLM-based cheating judge isn't perfect — it can miss things or flag false positives. And cost constraints meant we could only run 3 trials for frontier agents and single runs for everything else.

Looking ahead

As AI becomes infrastructure, every hospital, law firm, government, and enterprise will need to shape models for their own context, values, standards of quality, domain expertise, and more. We believe post-training will be something thousands of organizations around the world do, every day. That makes it one of the most important problems to get right and one of the most important to measure.

We plan to maintain PostTrainBench as a living benchmark, updating base models, refreshing evaluations as existing ones saturate, and expanding the set of agents tested.

PostTrainBench is a collaboration between Thoughtful, the ELLIS Institute Tübingen, University of Tübingen, Max Planck Institute for Intelligent Systems, Tübingen AI Center.