Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis
arXiv:2603.21454v2 Announce Type: replace
Abstract: LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods–paraphrase consistency, n-gram ove…