Introduction
Some people talk about "hard-to-verify tasks" and "easy-to-verify tasks" like these are both natural kinds. But I think splitting tasks into "easy-to-verify" and "hard-to-verify" is like splitting birds into ravens and non-ravens.
- Easy-to-verify tasks are easy for the same reason — there's a known short program that takes a task specification and a candidate solution, and outputs a score, without using substantial resources or causing undesirable side effects.
- By contrast, "hard-to-verify tasks" is a negative category — it just means no such program exists. But there are many kinds, corresponding to different reasons no such program exists.
Listing kinds of hard-to-verify tasks
I might update the list if I think of more, or if I see additional suggestions in the comments.
- Verification requires expensive AI inference. A verifier exists and works fine, but each run costs enough compute that you can't afford the number of labels you'd want.
- Given two proposed SAE experiments, say which will be more informative. Running both to find out costs $100–$1000 per comparison.
- Given two research agendas (e.g. pragmatic vs ambitious mech interp), say which produces more alignment progress. Same structure, but each comparison costs millions.
- Verification requires expensive human time. The verifier is a specific person, or a small set of people, and their time is scarce enough that you can't get enough labels.
- Given two model specs, write a 50-page report that Paul Christiano says is decision-relevant for choosing between them.
- Given a mathematical write-up, produce another that Terry Tao judges substantially better.
- The task lacks NP-ish structure. There's a fact of the matter about which answer is better, but no short certificate.
- Given two chess moves in a complex middlegame, say which is better. This is an interesting example because self-play ended up approximating a verifier anyway.
- The information isn't physically recoverable. The answer isn't recoverable, even in principle, from the current state of the world.
- Tell me what Ludwig Wittgenstein ate on [date].
- Verification destroys the thing being verified. Verification requires an irreversible change to a non-cloneable system, so you can't gather multiple samples. This is similar to (1), but rather than a monetary cost, it's the opportunity cost of verifying other samples instead.
- Construct an opening message that would get [person] to say yes to [request].
- The answer only arrives long after training ends. Ground truth exists, or will exist, but not on a timescale where it can give you a gradient.
- Tell me whether there'll be a one-world government in 20XX.
- Verifying requires breaking an ethical or legal constraint.
- Given [person]'s chat history, estimate their medical record. Checking requires their actual records, which is a privacy violation.
- Produce an answer to [question] that Suffering Claude would endorse. Checking requires instantiating Suffering Claude.
- Verifying is dangerous. Running the verifier risks catastrophe, because the artefact you're checking is itself the dangerous thing.
- Produce model weights and scaffolding for an agent that builds nanobots which cure Alzheimer's. To check, you have to run the factory — and the nanobots might build paperclips instead.
- There's no ground truth; the answer is partly constitutive. You're not discovering a fact, you're deciding what counts as a good answer. Verification in the usual sense doesn't apply.
- Produce desiderata for a decision theory, with a principled account of the tradeoffs.
- Produce the correct population axiology.
Implications
- Many applications of "hard-to-verify" are wrong, in the sense that words can be wrong. In particular, many claims of the form "hard-to-verify tasks are X" would be more accurate and informative if the author specified which kinds of tasks they mean — perhaps they only had one kind of hard-to-verify task in mind, and X doesn't hold for other kinds.
- I don't expect a universal strategy for automating all hard-to-verify tasks. And even if there does exist a universal strategy, it's not necessary to first discover it, if you have a specific hard-to-verify task in mind.
- I expect claims like "easy-to-verify tasks will generalise to all kinds of hard-to-verify tasks" are false, but claims like "easy-to-verify tasks will generalise to some kinds of hard-to-verify tasks" are true. This is because there are many kinds, so conjunctions are less likely and disjunctions are more likely.
- If you're trying to make progress on automating hard-to-verify tasks, it's worth thinking about what kind you want to target. Which kinds will be solved anyway due to commercial incentives? Which kinds will help us achieve a near-best future? Which kinds are crucial to automate before other kinds?
Discuss