Measure what Matters: Psychometric Evaluation of AI with Situational Judgment Tests
arXiv:2510.22170v2 Announce Type: replace
Abstract: Persona conditioning is widely used to steer large language model (LLM) behavior, but it is unclear whether it induces stable behavioral structure or superficial variation. We propose a framework to measure consistent behavioral tendencies using situational judgment tests (SJTs), multidimensional item response theory (MIRT), and structured synthetic personas, treating responses as observations of latent behavioral variables.
Across large-scale SJT and persona datasets, we find that persona-conditioned behaviors are stable across runs, latent trait scores predict external benchmarks (e.g., TruthfulQA, EmoBench), and MIRT reveals consistent latent structure. We validate these results through human annotation, benchmark evaluation, and internal consistency analyses.
We interpret these traits not as human personality, but as stable behavioral tendencies expressed across contexts. Our results show that scenario-based psychometric evaluation provides a more reliable alternative to classical self-report approaches for assessing LLM behavior, and we release datasets to support further study.