Uncategorised

Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training

IntroductionResearch by Frank Xiao (SPAR mentee) and Santiago Aranguri (Goodfire).Post-training can introduce undesired side effects that are difficult to detect and even harder to trace to specific training datapoints. We show that a probe-based metho…