LocalLLaMA

I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math

A few months ago, I got stuck on one line in the DeepSeek-R1 paper. It said models could improve through verifiable rewards. That sounded almost magical to me. Not because it was impossible, but because it made me wonder something very simple: Wh…