Md Zarif Hossain, Ahmed Imteaj

Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

Md Zarif Hossain, Ahmed Imteaj / April 8, 2026

arXiv:2407.14971v3 Announce Type: replace
Abstract: Vision-Language Models (VLMs) rely heavily on pretrained vision encoders to support downstream tasks such as image captioning, visual question answering, and zero-shot classification. Despite their s…

Author name: Md Zarif Hossain, Ahmed Imteaj

Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models