cs.CV

Grounding Video Reasoning in Physical Signals

arXiv:2604.21873v1 Announce Type: new
Abstract: Physical video understanding requires more than naming an event correctly. A model can answer a question about pouring, sliding, or collision from textual regularities while still failing to localize the…