cs.CL, cs.LG

Don’t Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation

arXiv:2602.20816v2 Announce Type: replace
Abstract: The core learning signal used in language model distillation is the standard Kullback-Leibler (KL) divergence between the student and teacher distributions. Traditional KL divergence tends to be domi…