MachineLearning

could refusal layers be masking dialect-conditioned safety failures in MoE models [d]

I set out to test whether AAVE-coded (African American English Vernacular) prompts cause MoE language models to route, deliberate, and respond differently from semantically matched AE (Academic English) prompts in safety-sensitive situations, especiall…