Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
arXiv:2604.02608v1 Announce Type: new
Abstract: Function vectors (FVs) — mean-difference directions extracted from in-context learning demonstrations — can steer large language model behavior when added to the residual stream. We hypothesized that F…