MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization
arXiv:2508.07833v3 Announce Type: replace
Abstract: Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model I…