Uncategorised

Benchmarking Real Work

Thanks to Megan Kinniment for helpful comments and discussion.TL;DR: Benchmarks like HCAST undersample fuzzy (hard to evaluate) tasks, meaning they might overestimate capability on long-horizon work. To sample fuzzy tasks we need to increase judge capa…