/u/Outrageous_Mark9761

Two LLMs competing on coding problems to train each other

/u/Outrageous_Mark9761 / April 16, 2026

The core idea: two instances of the same model solve identical coding problems independently. Better solution becomes chosen, worse becomes rejected in a DPO pair. Fine-tune. Repeat. Measure on HumanEval (never trained on). What makes this different fr…

Author name: /u/Outrageous_Mark9761

Two LLMs competing on coding problems to train each other