Frameworks For Supporting LLM/Agentic Benchmarking [P]
I think the way we are approaching benchmarking is a bit problematic. From reading about how frontier labs benchmark their models, they essentially create a new model, configure a harness, and then run a massive benchmarking suite just to demonstrate m…