Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
arXiv:2509.22258v5 Announce Type: replace
Abstract: Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets pred…