LocalLLaMA

Two LLMs competing on coding problems to train each other

The core idea: two instances of the same model solve identical coding problems independently. Better solution becomes chosen, worse becomes rejected in a DPO pair. Fine-tune. Repeat. Measure on HumanEval (never trained on). What makes this different fr…