StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval
arXiv:2601.20597v2 Announce Type: replace
Abstract: Continual Text-to-Video Retrieval (CTVR) is a challenging multimodal continual learning setting, where models must incrementally learn new semantic categories while maintaining accurate text-video al…