cs.CV

Compositional Video Generation via Inference-Time Guidance

arXiv:2605.14988v1 Announce Type: new
Abstract: Text-to-video diffusion models generate realistic videos, but often fail on prompts requiring fine-grained compositional understanding, such as relations between entities, attributes, actions, and motion…