Sayantan Dasgupta, Trevor Cohn, Timothy Baldwin

Don’t Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation

Sayantan Dasgupta, Trevor Cohn, Timothy Baldwin / May 8, 2026

arXiv:2602.20816v2 Announce Type: replace
Abstract: The core learning signal used in language model distillation is the standard Kullback-Leibler (KL) divergence between the student and teacher distributions. Traditional KL divergence tends to be domi…

Author name: Sayantan Dasgupta, Trevor Cohn, Timothy Baldwin

Don’t Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation