Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization
arXiv:2601.21078v3 Announce Type: replace
Abstract: Temporal Action Localization (TAL) requires identifying both the boundaries and categories of actions in untrimmed videos. While vision-language models (VLMs) offer rich semantics to complement visua…