Spark-LLM-Eval: A Distributed Framework for Statistically Rigorous Large Language Model Evaluation
arXiv:2603.28769v1 Announce Type: cross
Abstract: Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets gr…