From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs
arXiv:2604.14137v2 Announce Type: replace-cross
Abstract: Evaluating LLMs is challenging, as benchmark scores often fail to capture models’ real-world usefulness. Instead, users often rely on “vibe-testing”: informal experience-based evaluation, suc…