Research Blog

GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation

Vision-language models (VLMs) use images and text to plan robot actions, but they still struggle to decide what actions to take and where to take them. Most systems split these decisions into two steps: a VLM generates a plan in natural language, and a separate model translates it into executable actions. This approach often breaks […]

The post GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation appeared first on Microsoft Research.

cs.AI, cs.CV

Language Models Can Explain Visual Features via Steering

arXiv:2603.22593v2 Announce Type: replace-cross
Abstract: Sparse Autoencoders uncover thousands of features in vision models, yet explaining these features without requiring human intervention remains an open challenge. While previous work has propose…

cs.AI

Relationship-Aware Safety Unlearning for Multimodal LLMs

arXiv:2603.14185v3 Announce Type: replace
Abstract: Generative multimodal models can exhibit safety failures that are inherently relational: two benign concepts can become unsafe when linked by a specific action or relation (e.g., child-drinking-wine)…

Scroll to Top