Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation
arXiv:2605.14517v1 Announce Type: cross
Abstract: Holistic evaluation scores capture overall output quality but do not distinguish whether a model reproduced the structural form of a user’s request from whether it preserved the user’s specific intent….