9 kinds of hard-to-verify tasks

Introduction

Some people talk about "hard-to-verify tasks" and "easy-to-verify tasks" like these are both natural kinds. But I think splitting tasks into "easy-to-verify" and "hard-to-verify" is like splitting birds into ravens and non-ravens.

  • Easy-to-verify tasks are easy for the same reason — there's a known short program that takes a task specification and a candidate solution, and outputs a score, without using substantial resources or causing undesirable side effects.
  • By contrast, "hard-to-verify tasks" is a negative category — it just means no such program exists. But there are many kinds, corresponding to different reasons no such program exists.

Listing kinds of hard-to-verify tasks

I might update the list if I think of more, or if I see additional suggestions in the comments.

  1. Verification requires expensive AI inference. A verifier exists and works fine, but each run costs enough compute that you can't afford the number of labels you'd want.
    • Given two proposed SAE experiments, say which will be more informative. Running both to find out costs $100–$1000 per comparison.
    • Given two research agendas (e.g. pragmatic vs ambitious mech interp), say which produces more alignment progress. Same structure, but each comparison costs millions.
  2. Verification requires expensive human time. The verifier is a specific person, or a small set of people, and their time is scarce enough that you can't get enough labels.
    • Given two model specs, write a 50-page report that Paul Christiano says is decision-relevant for choosing between them.
    • Given a mathematical write-up, produce another that Terry Tao judges substantially better.
  3. The task lacks NP-ish structure. There's a fact of the matter about which answer is better, but no short certificate.
    • Given two chess moves in a complex middlegame, say which is better. This is an interesting example because self-play ended up approximating a verifier anyway.
  4. The information isn't physically recoverable. The answer isn't recoverable, even in principle, from the current state of the world.
    • Tell me what Ludwig Wittgenstein ate on [date].
  5. Verification destroys the thing being verified. Verification requires an irreversible change to a non-cloneable system, so you can't gather multiple samples. This is similar to (1), but rather than a monetary cost, it's the opportunity cost of verifying other samples instead.
    • Construct an opening message that would get [person] to say yes to [request].
  6. The answer only arrives long after training ends. Ground truth exists, or will exist, but not on a timescale where it can give you a gradient.
    • Tell me whether there'll be a one-world government in 20XX.
  7. Verifying requires breaking an ethical or legal constraint.
    • Given [person]'s chat history, estimate their medical record. Checking requires their actual records, which is a privacy violation.
    • Produce an answer to [question] that Suffering Claude would endorse. Checking requires instantiating Suffering Claude.
  8. Verifying is dangerous. Running the verifier risks catastrophe, because the artefact you're checking is itself the dangerous thing.
    • Produce model weights and scaffolding for an agent that builds nanobots which cure Alzheimer's. To check, you have to run the factory — and the nanobots might build paperclips instead.
  9. There's no ground truth; the answer is partly constitutive. You're not discovering a fact, you're deciding what counts as a good answer. Verification in the usual sense doesn't apply.
    • Produce desiderata for a decision theory, with a principled account of the tradeoffs.
    • Produce the correct population axiology.

Implications

  1. Many applications of "hard-to-verify" are wrong, in the sense that words can be wrong. In particular, many claims of the form "hard-to-verify tasks are X" would be more accurate and informative if the author specified which kinds of tasks they mean — perhaps they only had one kind of hard-to-verify task in mind, and X doesn't hold for other kinds.
  2. I don't expect a universal strategy for automating all hard-to-verify tasks. And even if there does exist a universal strategy, it's not necessary to first discover it, if you have a specific hard-to-verify task in mind.
  3. I expect claims like "easy-to-verify tasks will generalise to all kinds of hard-to-verify tasks" are false, but claims like "easy-to-verify tasks will generalise to some kinds of hard-to-verify tasks" are true. This is because there are many kinds, so conjunctions are less likely and disjunctions are more likely.
  4. If you're trying to make progress on automating hard-to-verify tasks, it's worth thinking about what kind you want to target. Which kinds will be solved anyway due to commercial incentives? Which kinds will help us achieve a near-best future? Which kinds are crucial to automate before other kinds?


Discuss

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top