Inference-Time Code Selection via Symbolic Equivalence Partitioning
arXiv:2604.06485v2 Announce Type: replace
Abstract: Sampling multiple candidate programs at inference time is an effective way to improve LLM code generation. However, its benefit depends on reliably selecting a correct solution from the generated pool. We observe that this selection problem has a distinctive semantic structure: correct solutions, despite differences in syntax, implementation, or algorithmic strategy, often converge to the same functional behavior over valid inputs. At the same time, consensus alone is not sufficient for correctness, because models can also produce correlated wrong solutions that implement the same mistaken behavior. We propose Symbolic Equivalence Partitioning (SEP), an inference-time selection framework that first uses problem-provided public examples as lightweight validity signals. SEP then uses symbolic execution to partition the remaining candidate programs into bounded functional equivalence classes and selects from the dominant equivalence class. Across HumanEval+ and LiveCodeBench, SEP consistently improves selection accuracy without auxiliary test generation, learned verifiers, or additional LLM inference. At $N=10$, SEP improves average accuracy from 0.754 to 0.826 on HumanEval+ and from 0.565 to 0.647 on LiveCodeBench, showing that symbolic functional agreement is an effective signal for inference-time code selection.