Evalet: Evaluating Large Language Models through Functional Fragmentation
arXiv:2509.11206v4 Announce Type: replace-cross
Abstract: Practitioners increasingly rely on Large Language Models (LLMs) to evaluate generative AI outputs through “LLM-as-a-Judge” approaches. However, these methods produce holistic scores that obscur…