kaivu - Provide.ai

Uncategorised

Benchmarking Real Work

kaivu / May 16, 2026

Thanks to Megan Kinniment for helpful comments and discussion.TL;DR: Benchmarks like HCAST undersample fuzzy (hard to evaluate) tasks, meaning they might overestimate capability on long-horizon work. To sample fuzzy tasks we need to increase judge capa…

Author name: kaivu

Benchmarking Real Work