cs.CV

Scaling Video Pretraining for Surgical Foundation Models

arXiv:2603.29966v2 Announce Type: replace
Abstract: Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent…

cs.AI, cs.LG, cs.SD

Woosh: A Sound Effects Foundation Model

arXiv:2604.01929v1 Announce Type: cross
Abstract: The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI’s publicly relea…

cs.CV

PLUME: Latent Reasoning Based Universal Multimodal Embedding

arXiv:2604.02073v1 Announce Type: new
Abstract: Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales be…

Scroll to Top