Stop benchmarking inference providers, a guide to easy evaluation

Hey ! Nathan from huggingface here, i maintained the Open LLM Leaderboard and in that time, I’ve evaluated around 10k model. I think there’s a pretty big misconception in how people benchmark LLMs.

Most setups I see rely on inference providers like OpenRouter or Hugging Face's inference providers.

Which is convenient, but there’s a catch

You’re often not actually benchmarking the model. You’re benchmarking the provider.

Between quantization, hidden system prompts, routing, or even silent model swaps, the results can be far from the actual model performance.

The actual “source of truth” for open source models is transformers.

So instead of evaluating through providers, I switched to:

Running models via transformers serve (OpenAI-compatible server)
Using inspect-ai as the eval harness
Spinning everything up with HF Jobs (on-demand GPUs)
Publishing results back to the hub

This way:

You control exactly what model is being run
You get reproducible results
You can scale to a lot of models without too much infra pain

Once everything is wired up, benchmarking becomes almost trivial.

You can run something like:

hf jobs uv run script.py \ --flavor l4x1 \ --secrets HF_TOKEN \ -e TRANSFORMERS_SERVE_API_KEY="1234"

And just swap:

the model
the hardware
the benchmark (GPQA, SWE-bench, AIME, etc.) You can then push eval results back to model repos and have them show up in community leaderboards on Hugging Face.

Here is a more detailed article I wrote describing the process: https://huggingface.co/blog/SaylorTwift/benchmarking-on-the-hub

Curious to hear your thoughts!

Are you benchmarking via providers or self-hosted?
Have you run into inconsistencies between endpoints?
Any better setups/tools I should look at?

Happy to share more details if people are interested.

submitted by /u/HauntingMoment
[link] [comments]

Leave a Comment