cs.LG

Hard Negative Sample-Augmented DPO Post-Training for Small Language Models

arXiv:2512.19728v2 Announce Type: replace
Abstract: Large language models (LLMs) continue to struggle with mathematical reasoning, and common post-training pipelines often reduce each generated solution to a binary outcome: correct or incorrect. This …