TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning
arXiv:2605.11572v2 Announce Type: replace
Abstract: Audio-visual understanding requires effective alignment between heterogeneous modalities, yet cross-modal correspondence remains challenging when temporally aligned audio and visual signals lack clea…