A Study on Hidden Layer Distillation for Large Language Model Pre-Training
arXiv:2605.11513v1 Announce Type: new
Abstract: Knowledge Distillation (KD) is a critical tool for training Large Language Models (LLMs), yet the majority of research focuses on approaches that rely solely on output logits, neglecting semantic informa…