What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction
arXiv:2407.08101v5 Announce Type: replace
Abstract: Vision-language models have shown impressive progress in recent years. However, existing models are largely limited to turn-based interactions, where each turn must be stepped (i.e., prompted) by the…