Author name: Solomon Messing

Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking

Solomon Messing / April 23, 2026

arXiv:2604.11581v3 Announce Type: replace
Abstract: LLM evaluations drive which models get deployed, which safety standards get adopted, and which research conclusions get published. Yet standard confidence intervals ignore variability from prompt phr…

cs.CL

Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking

Solomon Messing / April 17, 2026

arXiv:2604.11581v2 Announce Type: replace
Abstract: LLM evaluations drive which models get deployed, which safety standards get adopted, and which research conclusions get published. Yet these scores carry hidden uncertainty: rephrasing the prompt, sw…

cs.CL

Decomposing and Reducing Hidden Measurement Error in LLM Evaluation Pipelines

Solomon Messing / April 14, 2026

arXiv:2604.11581v1 Announce Type: new
Abstract: LLM evaluations drive which models get deployed, which safety standards get adopted, and which research conclusions get published. Yet these scores carry hidden uncertainty: rephrasing the prompt, switch…