Cat-DPO: Category-Adaptive Safety Alignment
arXiv:2604.17299v1 Announce Type: new
Abstract: Aligning large language models with human preferences must balance two competing goals: responding helpfully to legitimate requests and reliably refusing harmful ones. Most preference-based safety alignm…