Two LLMs competing on coding problems to train each other

The core idea: two instances of the same model solve identical coding problems independently. Better solution becomes chosen, worse becomes rejected in a DPO pair. Fine-tune. Repeat. Measure on HumanEval (never trained on).

What makes this different from standard RLHF or self-play:

The reward signal is pure execution. No human labels, no judge model, no curated outputs. The model never sees the test assertions — it only gets back what Python actually threw. Code passes or it doesn't. Partial credit via pass_count / total_tests. Same core idea as o1/R1 (verifiable reward) but using DPO instead of PPO/GRPO, so it runs on local hardware.

Both-fail rounds still generate training signal. When both agents fail, the one with higher partial pass rate becomes chosen. No round is wasted.

Four specialists per agent, same model, different temperatures — logical (0.3), creative (0.7), skeptical (0.4), empathetic (0.5). Temperature variance is enough to make genuinely different solutions from the same weights. The coordinator picks whichever specialist passed the most assertions.

Agents also build persistent memory across sessions — episodic retrieval via embeddings, pattern consolidation to semantic memory at end of each cycle (sleep phase). Mirrors Complementary Learning Systems theory. In practice the model sees "last 3 times you got an IndexError on a list problem, it was off-by-one" before attempt 1.

First numbers on Colab A100, 1 cycle / 10 rounds: Baseline Pass@1 0.671 → 0.683 (+1.2pp) from 39 DPO pairs. Early but directionally right.

Vibecoded with Claude Code. Code: https://github.com/info-arnav/CogArch

submitted by /u/Outrageous_Mark9761
[link] [comments]

Leave a Comment