RDMA: Cost Effective Agent-Driven Rare Disease Mining from Electronic Health Records
arXiv:2507.15867v2 Announce Type: replace
Abstract: Rare diseases affect 1 in 10 Americans yet remain systematically underdocumented in clinical records. ICD-based systems cannot capture their breadth, over 50\% of Orphanet codes lack a direct ICD mapping and only 2.2\% of HPO codes have matching ICD codes, leaving patient populations invisible and delaying diagnosis. Mining unstructured clinical notes offers a direct path forward, but real notes are long, noisy, and abbreviation-dense, and limited annotations make fine-tuning infeasible, demanding approaches that generalize without task-specific training. We present Rare Disease Mining Agents (RDMA), an agentic framework equipping smaller quantized LLMs with tools for abbreviation resolution, implicit phenotype reasoning, and ontology grounding against Orphanet and HPO. RDMA substantially outperforms fine-tuned and RAG-based baselines across benchmarks with different data characteristics, without any task-specific training. A small quantized model achieves maximal performance, reducing inference costs by up to 10x and local hardware costs by up to 17x, enabling private deployment on standard hardware without cloud-based PHI exposure. RDMA's uncertainty-flagging mechanism further reduces expert annotation burden while preserving agreement quality, supporting scalable rare disease documentation in clinical practice. Available at https://github.com/jhnwu3/RDMA.