Emergent misalignment evident in activations at low poisoning doses – long before behavioral checks flag it
Status: Exploration with adapters for Llama 3.2-3B + 3.1-8B at various poisoning dose levels using medical advice examples from Turner et al 2025 with a bare-bones compute budget. Conclusions are exploratory; replication on other models and bad data ty…