Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
arXiv:2512.02650v2 Announce Type: replace
Abstract: This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially cruci…