Research Blog

GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation

Vision-language models (VLMs) use images and text to plan robot actions, but they still struggle to decide what actions to take and where to take them. Most systems split these decisions into two steps: a VLM generates a plan in natural language, and a separate model translates it into executable actions. This approach often breaks […]

The post GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation appeared first on Microsoft Research.

cs.AI

Relationship-Aware Safety Unlearning for Multimodal LLMs

arXiv:2603.14185v3 Announce Type: replace
Abstract: Generative multimodal models can exhibit safety failures that are inherently relational: two benign concepts can become unsafe when linked by a specific action or relation (e.g., child-drinking-wine)…

Scroll to Top