cs.AI, cs.LG

Stable Reasoning, Unstable Responses: Mitigating LLM Deception via Stability Asymmetry

arXiv:2603.26846v1 Announce Type: new
Abstract: As Large Language Models (LLMs) expand in capability and application scope, their trustworthiness becomes critical. A vital risk is intrinsic deception, wherein models strategically mislead users to achi…