Universal Skeleton Understanding via Differentiable Rendering and MLLMs
arXiv:2603.18003v3 Announce Type: replace
Abstract: Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet remain confined to their native modalities and cannot directly process structured, non-visual data such as human…