Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight
arXiv:2512.19691v3 Announce Type: replace
Abstract: Reference labels for machine-learning benchmarks are increasingly synthesized with LLM assistance, but their reliability remains underexamined. We audit MedCalc-Bench, a clinical benchmark for medica…