One Identity, Many Roles: Multimodal Entity Coreference for Enhanced Video Situation Recognition
arXiv:2604.23173v1 Announce Type: new
Abstract: Video Situation Recognition (VidSitu) addresses the challenging problem of “who did what to whom, with what, how, and where” in a video. It tests thorough video understanding by requiring identification …