Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis
arXiv:2506.08849v4 Announce Type: replace
Abstract: Vision-Language Foundation Models (VLFMs) exhibit remarkable generalization, yet their direct application to medical ultrasound is severely hindered by a profound modality gap. The unique acoustic physics of ultrasound, characterized by speckle noise, shadowing, and heterogeneous textures, often degrades the performance of off-the-shelf VLFMs. To bridge this gap, we propose a novel Hybrid Tuning (HT) strategy for the parameter-efficient adaptation of CLIP-based models to ultrasound analysis. Instead of updating the pre-trained weights, HT freezes the visual backbone and integrates a specialized lightweight adapter. This adapter features a Frequency Filtering module to suppress domain-specific periodic artifacts and a Noise Estimation module to dynamically calibrate feature representations. Extensive evaluations across six multi-center datasets demonstrate that our HT-enhanced models significantly outperform existing state-of-the-art adapters and medical VLFMs in both segmentation and classification tasks. Notably, HT exhibits exceptional data efficiency in few-shot scenarios and robust cross-dataset generalization. Our findings prove that preserving pre-trained semantic priors while explicitly modeling ultrasound-specific noise is key to unlocking foundational intelligence in automated ultrasound diagnosis. The source code is available at https://github.com/jinggqu/NextGen-UIA.