Junwon Lee, Juhan Nam, Jiyoung Lee

Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

Junwon Lee, Juhan Nam, Jiyoung Lee / March 30, 2026

arXiv:2512.02650v2 Announce Type: replace
Abstract: This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially cruci…

Author name: Junwon Lee, Juhan Nam, Jiyoung Lee

Hear What Matters! Text-conditioned Selective Video-to-Audio Generation