Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking
arXiv:2604.11581v3 Announce Type: replace
Abstract: LLM evaluations drive which models get deployed, which safety standards get adopted, and which research conclusions get published. Yet standard confidence intervals ignore variability from prompt phrasing, model temperature, and judge model choice. The omitted variance produces under-coverage that worsens with more data and can shift results enough to reverse conclusions. The same unmeasured variance opens benchmarks to exploitation. Model developers can optimize against measurement noise instead of genuine performance, as \citet{singh2025leaderboard} document. This paper decomposes LLM pipeline uncertainty into its sources, distinguishes variance that shrinks with more data from sensitivity to researcher design choices, and uses design-study projections to reduce total error. We show a small-sample pilot is sufficient to derive confidence intervals that approach nominal coverage and to identify which design changes yield the largest precision gains. Applying the approach to ideology annotation, safety classification, MMLU benchmarking, and a human-validated propaganda audit reveals different dominant variance sources by domain and scoring method. What's more, we show optimized budget allocation halves estimation error at equivalent cost (MMLU), and on our propaganda audit, the recommended pipeline outperforms 73\% of single-configuration alternatives against a human baseline.