Test of Time: Rethinking Temporal Signal of Benchmark Contamination
arXiv:2509.00072v3 Announce Type: replace
Abstract: Post-cutoff performance decay has been widely interpreted as a temporal signal for benchmark contamination. We critically examine this belief and demonstrate that this temporal signal is highly sensitive to how benchmark questions are constructed. Specifically, we show that LLM-generated questions can produce remarkably different temporal patterns compared to fill-in-the-blank questions directly retrieved from the very same materials. We validated this finding on previous benchmarks that reported clear post-cutoff performance decay such as LiveCodeBench and further showed simple LLM transformation could effectively remove this temporal pattern when evaluated on the same models. We also provide a mechanistic understanding of our observation using influence function analysis. Overall, this work offers a new perspective on the sensitivity of temporal contamination signal and highlights the need for more robust contamination detection methods for reliable AI evaluation.