Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study
arXiv:2603.24125v1 Announce Type: new
Abstract: During training, Large Language Models (LLMs) learn social regularities that can lead to gender bias in downstream applications. Most mitigation efforts focus on reducing bias in generated outputs, typic…