Uncategorised

Zero-Shot Alignment: Harm Detection via Incongruent Attention Mechanisms

Intro: I made a small adapter (~4.7M parameters) that sits on top of a frozen Phi-2 model and forces it through two mathematically opposing attention mechanisms. The result initially was that it generalizes past its sparse training, sometimes into surp…