Eliciting Latent Knowledge problem for the unfamiliar:

Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.

But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.

In these cases, the prediction model "knows" facts (like "the camera was tampered with") that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?

The most analyzed version of the problem is a toy scenario with a diamond in the vault:

Imagine you are developing an AI to control a state-of-the-art security system intended to protect a diamond from theft. The security system, the SmartVault, is a building with a vast array of sensors and actuators which can be combined in complicated ways to detect and stop even very sophisticated robbery attempts. (...) The SmartVault can execute plans sufficiently sophisticated that humans can’t really know if the diamond is safe or merely appears safe. Whatever complicated hard-to-follow sequence of actions the search procedure found might actually have replaced the diamond with a fake, or tampered with the camera

So we have humans with their human ontology, the Predictor (the model we need to interpret), the Planner (can be ignored for simplicity), and maybe we're training a Reporter (intended to interpret the Predictor). We want to know whether there's a real diamond in the vault.

We assume the Predictor is inner-aligned to myopically making predictions, it won't strategically mislead us by outputting wrong information at the right time. We also assume that something (Z) strongly correlated with the real diamond is important for the Predictor's ability to make good predictions (this is ARC's working definition of knowledge).

I was wondering whether the problem is possible to solve with brute force, at least if we have unlimited compute:

Formalize the human model of physics. (Let's assume the model is true enough or we can check if it's true enough.)
Prove that Predictor's model reduces to the human model under normal conditions. This will give us a correspondence between the human ontology and the Predictor's ontology.

Is step 2 possible, at least with unlimited compute? This is basically the crux of my question. Some reasons to believe it's possible:

We do something like this with models of physics. For example, we prove that general relativity reduces to Newtonian mechanics under normal conditions. With enough compute I could take String Theory and check all the 10⁵⁰⁰ possible universes to find ours. Since the Predictor is inner-aligned, it can be treated basically as a model of physics^[1]... In philosophy, this is called intertheoretic reduction (on Stanford). It's not entirely defined now, but it's reasonable to believe it is definable. There should exist a canonical way to match one model of physics to another. If there are multiple, equally good ways to map the current human physics to the true physics (leading to different interpretations of reality), then it would mean the human physics isn't grounded in anything at all.

I believe I'm not just restating the problem. I'm saying there should be a principled way to just prove how exactly one model reduces to the other instead of using arbitrary tricks to create a Reporter.

Related work

The most similar proposal to this I remember is "have AI help humans improve our understanding". It is dismissed for being uncompetitive. Yet uncompetitive ELK is arguably not actually solved and there are deeper reasons to reject the proposal. It relies on more roundabout methods than just using brute force to find the intertheoretic reduction.

ARC's heuristic explanations and no-coincidence principle agenda sounds similar too, but it's more ambiguous than intertheoretic reduction. It's a claim about all of math.

There's also the proposal 2.1 by Abram Demski, but I can't entirely parse it.

Problems I see

Some problems with 2 step I can think of -

A) The Predictor might be a mess of high-level and low-level variables.

B) The Predictor might use multiple models of physics at the same time. Similar thing is mentioned in this counterexample.

C) The Predictor might use heuristics instead of physical laws. (Though that, by ARC's definition of knowledge, would mean the Predictor doesn't "know" the consequences of the laws beyond the heuristics.)

I think it's pretty uncertain how much those are "real" problems. Maybe a clever version of intertheoretic reduction would deal with them automatically. A mathematical/philosophical analysis is needed. In any case, it feels like a big missed opportunity to not talk about intertheoretic reduction in the context of ELK at all.

^{^}
This might require making an additional assumption that the Predictor models physics, but ARC is ready to make it I think.

Discuss

Can ELK be brute-forced? Intertheoretic reduction

Related work

Problems I see

Leave a Comment