cs.AI, stat.AP

Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight

arXiv:2512.19691v3 Announce Type: replace
Abstract: Reference labels for machine-learning benchmarks are increasingly synthesized with LLM assistance, but their reliability remains underexamined. We audit MedCalc-Bench, a clinical benchmark for medica…