cs.AI, cs.CL, cs.IR

Confident, Calibrated, or Complicit: Safety Alignment and Ideological Bias in LLM Hate Speech Detection

arXiv:2509.00673v2 Announce Type: replace-cross
Abstract: We investigate the efficacy of Large Language Models (LLMs) in detecting implicit and explicit hate speech, examining how models with minimal safety alignment (uncensored) compare with more hea…