Author name: Arjun Khandelwal

Uncategorised

Early-stage empirical work on “spillway motivations”

Arjun Khandelwal / May 1, 2026

Previously, we proposed spillway motivations as a way to mitigate misalignment induced via training a model using flawed reward signals. In this post, we present some early-stage empirical results showing how spillway motivations can be used to mitigat…

Uncategorised

To what extent is Qwen3-32B predicting its persona?

Arjun Khandelwal / April 30, 2026

TL;DRWe test to what extent Qwen3-32B behaves as though it is trying to predict what “Qwen3” would do. We do this by using Synthetic Document Finetuning (SDF) to instill meta-beliefs of the form “Qwen3 believes X, even though X is false”, then check w…