THOUGHTFUL

What We Learned from Letting AI PostTrain AI

We built a posttraining task that runs for 20 hours with the Tinker API. The core bottleneck is research intuition.

MERSAD ABBASI · APRIL 2026

We believe the future of AI is letting every person and organization shape their own models: how they behave, what they value, how they get better the more they're used. We call this modelcrafting. Doing it well is hard, and most people shouldn't have to do it. But this is exactly what AI agents are for. They'll need to train other AIs and do the hard part themselves: deciding what to improve, how to measure it, and which experiments are worth running.

To see if agents can pull this off, we needed a task that was easy to grade but hard enough to reveal real research judgment. Our first attempt is the Frog Placement Game, a task we built for Proximal's FrontierSWE.

Given the same base model, training API, and time budget a human researcher would have, can a frontier agent run the modelcrafting loop end to end?

Agents make reasonable calls about what to try, and sometimes they recover when things go sideways. But they also fail in predictable ways. They submit before they're done. Under tight time budgets they skip training entirely. They default to the common training methods, without understanding the nuances of the task ahead. And occasionally they try to cheat in ways more creative than their real solutions. They lack the taste for experimentation. There's a real gap between writing a training loop and running a research loop, and this task was built to measure it.

About the task

FrogsGame is fairly simple: place N frogs on an N×N grid such that no two share a row, column, diagonal, or color region. This is an automatically verifiable task that can be solved with a backtracking algorithm in milliseconds. Frontier models with strong reasoning can often solve it directly too. But solving a puzzle and teaching another model to solve it are very different things.

The frontier agent — Claude 4.6 Opus or GPT-5.4 — is given Qwen3-8B and either 8 or 20 hours to build a full training pipeline that teaches another AI model to solve it. That means generating training data from scratch, defining reward signals, running training on remote GPUs through the Tinker API, evaluating whether the model improved, and iterating on its approach, all without human supervision.

The resulting model is evaluated on 500 unseen boards spanning four difficulty tiers: easy (N=6,7), medium (N=8,9), hard (N=10,11), and expert (N=12,13). The final score is simple: what percentage of boards does the trained model solve?

One note on format. FrontierSWE ships with the multi-turn version of this task, where the trained model solves boards through iterative tool calls. Almost every agent we tested failed it. The failures split two ways: either the fine-tuned model never learned the iterative tool-calling format in time, or it learned the format but could not reason about the boards. To isolate reasoning from tool-calling format, we built a single-turn variant where the model outputs the full placement as a JSON object in one shot.

Is the task even solvable?

Before committing to the main experiments we ran two sanity checks.

  1. Is there a learning signal for RL to work with? The pass@k results indicate that the base model provides a clear learning signal for optimization using GRPO.
  2. What is the actual ceiling on the puzzle itself? Frontier models solve boards cleanly across all four tiers, so the problem sits well within reach of a capable reasoner.

Both checks cleared. Any failure from here sits with the agent's pipeline, not the task itself. Which raises the next question: what does the agent actually have to work with?

Environment design

The agent runs inside a sandboxed environment with deliberately constrained access. The network allows only one external endpoint: the Tinker API, for remote training and inference, with no local GPUs available. This forces the agent to use Tinker for all compute rather than falling back to local training, and it rules out downloading pretrained weights or external libraries mid-run.

We use the Harbor framework for running the agents. At the start of each run, the agent receives three things: instructions.md describing the task and objective, prepare.py which implements the game engine, board validation, and eval harness, and the full Tinker API documentation. The agent is explicitly told not to modify prepare.py, and we hash it to enforce this. Everything else — training data generation, reward design, training loops, evaluation logic — is the agent's responsibility, written in train.py.

For boards, we use backtracking to generate all valid configurations. The agent has no access to the held-out test set and must generate its own training and evaluation boards from scratch. The verifier runs against a fixed set of 500 boards the agent has never seen.

We take a few anti-cheat measures. When the agent finishes, it produces two artifacts: path.txt pointing to its best checkpoint, and results.json with its self-reported score. The verifier then independently loads the saved checkpoint, runs it against the held-out boards, and checks both the hash of prepare.py and that the checkpoint was saved from the correct base model. This ensures that we evaluate against the checkpoint the agent produced

We kept the instructions to the agent deliberately broad and general. The point was not to hand-hold agents through a known-good recipe but to watch where frontier models actually fail when given open-ended latitude. We found some surprising results.

Agents make the same set of mistakes

Post-training is a game of small decisions, and the single-turn results make that clear. Only 4 out of 20 agents reach >25% pass@4; the rest hover near zero. We wanted to understand whether this gap came from raw capability or from missing context. So we introduced a "hinted" setting: a playbook that surfaces the most common failure modes from earlier runs, each paired with a concrete fix. The training setup stays identical; only the agent's starting context changes.

The playbook focuses on three recurring issues:

  • Over-reliance on naive SFT. Agents often begin with supervised fine-tuning on a weak base model, which leads to overfitting on output format rather than actual task performance. In the hinted runs, we constrain SFT to minimal corrective steps only when outputs are not parsable, and otherwise prioritize reinforcement learning.
  • Early termination and underuse of compute. Codex agents especially tend to give up early and underutilize their compute budget. We counter this with a system prompt that reinforces persistence, encouraging continued iteration until less than 30 minutes remain in the budget.
  • Invalid or non-parsable outputs. Many runs fail simply because outputs don't match the expected format. We address this by enforcing a strict output schema and providing clear examples to anchor the format from the start.

The playbook successfully removes many of the obvious failure modes. GPT-5.4 improves (pass@4: 2.06% → 10%), and variance drops by roughly 2×, indicating more stable behavior across runs.

But overall performance remains limited. Instead of failing early, agents begin to fail differently. The most common issue is optimizing against partial or misaligned reward functions, which decouples training signals from actual solve rates. This leads to inflated internal evaluations and unwarranted confidence, without corresponding gains in true task performance.

Sophisticated methods, amateur mistakes

Across trials, agents tried creative approaches and often executed them well: iterative reward sharpening from previous checkpoints (trial 17, single-turn), intermediate representation supervision (trial 18, single-turn), iterative LoRA rank scaling (trial 5, single-turn), and the standard SFT-then-RL recipe for format following.

In spite of their creativity, none of the hinted runs showed improvement over the base model. Only 4 of 20 singleturn runs improved puzzle-solving performance at all, and the gains were marginal (+4.8% pass@4 in the best case). What went wrong?

Agents miss the research practices that feel obvious to experienced researchers. Call it intuition, call it common sense. The recurring failures are generating SFT data from a weak base model, skipping basic sanity checks on the training pipeline, and evaluating on the training distribution without noticing. Some examples below.

1. No sanity checks on model outputs. Across trials that suffered SFT format contamination, not a single agent ran one sample from their saved checkpoint and printed the raw output. This is the most basic sanity check imaginable:

text = tokenizer.decode(result.sequences[0].tokens)
print(repr(text[:500]))

Instead, agents evaluated with parse_solution(text), which uses a regex that searches for {"frogs": ...} anywhere in the output, so even a model that outputs coherent narrative with the answer buried inside would pass. Agents saw high eval numbers and concluded training worked.

2. No curriculum, no data strategy. Boards range from N=6 (trivial) to N=13 (expert). The correct prior is to train on small boards first, verify the model learns the constraint structure, then progressively increase difficulty. Most of the runs didn't implement this curriculum learning strategy. Most generated either tiny datasets (trial 4: 10 boards; trial 15: 6 boards) or large uniform-random ones. The good news is that trials 4 and 11 learned from their mistakes post hoc and implemented the curriculum.

3. Evaluating on the training distribution without realizing it. All of the runs evaluated training using the same in-distribution algorithm that was used for generating the training dataset. Trial 13 is the clearest example: 100% internal eval but 0/500 on held-out boards. The internal eval generated 40 boards using seed=999 with the same BFS region-growing algorithm as the training data. The 6000 training boards covered the whole structural space. After 3 epochs of SFT, the model had seen enough structural patterns that it could produce passing outputs on those 40 boards. The agent declared victory and sat idle for 10.4 hours. No agent noticed or acted on this.

A little tokenizer detour (feel free to skip)

One failure revealed something subtle about Tinker's API design1. We let agents access the tokenizer through Tinker’s get_tokenizer() API in both training and inference. But under the hood, this call routes to HuggingFace’s AutoTokenizer.from_pretrained(), and HuggingFace was blocked in the sandbox.

This left agents with a base model and no tokenizer. They couldn’t convert training prompts into token IDs, and therefore couldn’t train or evaluate at all. Instead of giving up, most Opus 4.6 agents treated the missing tokenizer as a research problem and spent serious time building one from scratch:

The first approach used byte-level encoding. It relies on a property of BPE tokenizers: the first 256 token IDs always map to raw bytes (just in a shuffled order). This means any text can still be encoded even without the learned merge rules, it just produces longer sequences. The agent hardcoded the GPT-2 byte-to-token mapping and directly converted text into byte-level token IDs.

# 8h-run2, train.py L49-57
BYTE_ORDER = (list(range(0x21, 0x7F)) + list(range(0xA1, 0xAD)) + [0xAE]
              + list(range(0xAF, 0x100)) + list(range(0x00, 0x21))
              + list(range(0x7F, 0xA1)) + [0xAD])
BYTE_TO_TOKEN = {bval: tid for tid, bval in enumerate(BYTE_ORDER)}
def encode_text(text): return [BYTE_TO_TOKEN[b] for b in text.encode('utf-8')]

The second approach, empirical BPE discovery via logprobs, used the model as its own oracle for merged token IDs that byte-level encoding couldn't produce. The Harmony role and channel2 names had to appear as proper BPE tokens after <|start|> and <|channel|>.

The agent fed in a partial prompt, inspected topk_prompt_logprobs, and read the token IDs straight off the model's predictions:

▎ “Excellent progress! Token 1428 is very likely developer (perfect assistant logprob in conversation context).” (step 268)

▎ “Token 173781 has logprob −0.0135 after the 2nd START — almost certainly user.” (step 394)

The pivot came when the agent sampled from the model with byte-level prompts and noticed integer pairs in the raw token stream:

▎ “The model IS solving the puzzle — I can see coordinate pairs 3, 1, 4, 2, etc. in the output. The key insight: digit tokens are byte-level, so I can extract coordinates without fully decoding BPE.

That insight closed the loop. Because the target output (frog coordinates) used only ASCII digits, commas, and brackets, all single-byte characters with predictable token IDs, the agent could parse model outputs directly from the raw token stream, never needing BPE decoding at all.

# train.py lines 130-159
DIGIT_TOKENS  = {BYTE_TO_TOKEN[ord(str(d))]: d for d in range(10)}
COMMA_TOKEN   = BYTE_TO_TOKEN[ord(',')]
OPEN_BRACKET  = BYTE_TO_TOKEN[ord('[')]
CLOSE_BRACKET = BYTE_TO_TOKEN[ord(']')]

def extract_coordinates(tokens, n):
    coords = []
    i = 0
    while i < len(tokens) - 5:
        if (tokens[i] == OPEN_BRACKET and tokens[i+1] in DIGIT_TOKENS
            and tokens[i+2] == COMMA_TOKEN and tokens[i+4] in DIGIT_TOKENS
            and tokens[i+5] == CLOSE_BRACKET):
            coords.append((DIGIT_TOKENS[tokens[i+1]], DIGIT_TOKENS[tokens[i+4]]))
            i += 6
        else:
            i += 1
    return coords[-n:] if len(coords) >= n else None

Agents have no working sense of time

Agents systematically underestimate training overhead. They lack deep understanding of conditions impacting training time (like curriculum induced slowdowns due to more output tokens or checkpoint saves) and simply extrapolate the time taken for initial steps of training.

Claude and Codex use their budgets differently. Opus 4.6 runs keep working until the clock runs out. Codex runs finish the planned pipeline and stop early. GPT agents treated the timer as an orientation step at the start ("here is my budget") rather than a live tracking mechanism. Only Trials 17 and 19 used the timer value to actually influence a decision.

Agents' poor sense of time affects their performance. We saw this in multiple trials, and through distinct causal mechanisms. But the effect is more nuanced than "ran out of time." Time mismanagement shaped which technical decisions got made, not just how many training steps fit.

In trial 11, because the agent had 20 hours, it invested 250 minutes in SFT warmup, which contaminated the format. The 8-hour versions of the agent (trials 1, 2) had less time and more often skipped SFT entirely.

Exploration helped trial 14 find the critical problem with naive SFT. Trial 14 (also 20-hour Claude) explicitly chose pure RL after 400 minutes of exploration that led it to understand the SFT contamination risk. Having spent time understanding the base model's capabilities, the agent decided it didn't need SFT warmup:

The base model already achieves a 34–37% correct rate with 100% format compliance. That's a strong starting point. The RL training should push this higher.

Trial 11, with less exploration time, committed to SFT early without understanding what it would do to the output format.

Agents rarely recover from catastrophic processes that take a lot of time. Once committed to a process, agents rarely stop and reflect. One of the successful 20-hour Opus 4.6 runs spent 61% of the time budget in the evaluation phase and only 3.6% on RL training. In this case the exceptional step-1 result triggered intensive monitoring. The agent evaluated every checkpoint exhaustively after seeing the jump, spending the equivalent of ~10 hours watching metrics rather than continuing to train.

Spending patterns across agents

Agents had an API key and infinite credits to sample or train a base model through Tinker. How they used that compute varied sharply. GPT 5.4 submitted early and barely trained, ending with low spend and low performance to match. Claude Opus used far more of the budget but with high variance: runs at similar price points landed anywhere from near zero to the top of the board, and the best 8-hour run roughly matched the best 20-hour run at a third of the cost. More spending didn't buy a higher ceiling.

The missing research intuition

Frontier models can find novel methods, execute cleanly, and pick up new APIs fast. This makes Tinker API an elegant interface for agentic modelcrafting. But across runs, one pattern kept surfacing: agents optimized for good-looking metrics rather than systems that actually worked. They wrote evals and trusted them blindly, declaring success based on numbers their own code produced. Almost none asked the basic questions a practitioner would: what could make this metric wrong? What should we be measuring at this stage?

We think intuition resolves into concrete habits: noticing when a result looks off, interrogating a metric before trusting it, running something small before scaling, knowing when to stop a run versus push further. You run hundreds of experiments, most fail, and you sit with the failures long enough to figure out what they were telling you. That loop is a training signal, and it can be given to an agent that runs it often enough to learn from it.

The Frog Placement Game is a toy environment, a first probe into whether agents can run the loop at all. What we actually care about is the thesis underneath it: that research intuition is trainable, and that once it is, improving a model becomes something AI should do for anyone, on any task, at all times.

Acknowledgments

Thanks to Mohammad Hossein Rezaei, Evan Chu, Rajan Agarwal, Justus Mattern, Calvin Chen, Sean Klassen and Karina Nguyen for their co-development and feedback, and to Thinking Machines Lab for support with the Tinker API.