cs.CL

Benchmarking Testing in Automated Theorem Proving

arXiv:2604.23698v1 Announce Type: new
Abstract: Recent advances in large language models (LLMs) have shown promise in formal theorem proving, yet evaluating semantic correctness remains challenging. Existing evaluations rely on indirect proxies such a…