Architectural Observability Collapse in Transformers
arXiv:2604.24801v2 Announce Type: replace
Abstract: Activation monitoring can catch confident errors in autoregressive transformers only if training preserved an internal decision-quality signal that output confidence does not expose. Monitorability is an architectural property before it is a monitor-design problem. We define observability: the linear readability of per-token decision quality from frozen mid-layer activations after controlling for max-softmax confidence and activation norm. Confidence controls absorb on average 60.3% of raw probe signal across 14 models in 6 families.
Observability is not a generic property of transformers. In Pythia's controlled suite, all three tested runs at the 24-layer, 16-head configuration collapse to rho_partial ~0.10 across a 3.5x parameter gap and two Pile variants, while six other configurations occupy a separated healthy band from 0.21 to 0.38. The output-controlled residual r_OC collapses at the same points; neither nonlinear probes nor layer sweeps recover healthy-range signal. Checkpoint dynamics localize the cause: both matched-width configurations form the signal at the earliest measured checkpoint, and training erases it in the 1.4B even as it reaches lower final loss than the 1B.
Across independent recipes the collapse map changes but the phenomenon persists. Qwen 2.5 and Llama differ by 2.9x at matched 3B scale, with probe-seed distributions that do not overlap. Mistral 7B v0.3 preserves observability where Llama 3.1 8B collapses at identical 32-layer, 32-head, 4096-hidden shape. Within Qwen 2.5, observability persists from 0.5B through 32B. A WikiText-trained observer transfers to downstream QA without task-specific training: at 20% flag rate, exclusive catch reaches 10.9-13.4% in seven of nine model-task cells, near the 12-15% language-modeling ceiling. Architecture selection is a monitoring decision.