Uncategorised

Training Model to Predict Its Own Generalization: A Preliminary Study

tl;drWe study how well LLMs can be trained to answer questions like “what will happen if I am trained on examples like XYZ”, focusing on emergent misalignment and other cases of surprising generalization.We see signs of life on the less surprising form…