Sanjeeevan Selvaganapathy, Mehwish Nasim

Confident, Calibrated, or Complicit: Safety Alignment and Ideological Bias in LLM Hate Speech Detection

Sanjeeevan Selvaganapathy, Mehwish Nasim / May 5, 2026

arXiv:2509.00673v2 Announce Type: replace-cross
Abstract: We investigate the efficacy of Large Language Models (LLMs) in detecting implicit and explicit hate speech, examining how models with minimal safety alignment (uncensored) compare with more hea…

Author name: Sanjeeevan Selvaganapathy, Mehwish Nasim

Confident, Calibrated, or Complicit: Safety Alignment and Ideological Bias in LLM Hate Speech Detection