cs.CV

HumanOmni-Speaker: Identifying Who said What and When

arXiv:2603.21664v2 Announce Type: replace
Abstract: While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversati…