Research Blog

ADeLe: Predicting and explaining AI performance across tasks

Lexin Zhou, Xing Xie / April 1, 2026

AI benchmarks report how large language models (LLMs) perform on specific tasks but provide little insight into their underlying capabilities that drive their performance. They do not explain failures or reliably predict outcomes on new tasks. To address this, Microsoft researchers in collaboration with Princeton University and Universitat Politècnica de València introduce ADeLe (opens in new tab) (AI […]

The post ADeLe: Predicting and explaining AI performance across tasks appeared first on Microsoft Research.

Research Blog

AsgardBench: A benchmark for visually grounded interactive planning

Andrea Tupini, Lars Liden, Reuben Tan, Yu Wang, Jianfeng Gao / March 26, 2026

Imagine a robot tasked with cleaning a kitchen. It needs to observe its environment, decide what to do, and adjust when things don’t go as expected, for example, when the mug it was tasked to wash is already clean, or the sink is full of other items. This is the domain of embodied AI: systems […]

The post AsgardBench: A benchmark for visually grounded interactive planning appeared first on Microsoft Research.

Research Blog

GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation

/ March 26, 2026

Vision-language models (VLMs) use images and text to plan robot actions, but they still struggle to decide what actions to take and where to take them. Most systems split these decisions into two steps: a VLM generates a plan in natural language, and a separate model translates it into executable actions. This approach often breaks […]

The post GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation appeared first on Microsoft Research.

Research Blog

Systematic debugging for AI agents: Introducing the AgentRx framework

Shraddha Barke, Arnav Goyal, Alind Khare, Chetan Bansal / March 12, 2026

As AI agents transition from simple chatbots to autonomous systems capable of managing cloud incidents, navigating complex web interfaces, and executing multi-step API workflows, a new challenge has emerged: transparency. When a human makes a mistake, we can usually trace the logic. But when an AI agent fails, perhaps by hallucinating a tool output or […]

The post Systematic debugging for AI agents: Introducing the AgentRx framework appeared first on Microsoft Research.

Research Blog

PlugMem: Transforming raw agent interactions into reusable knowledge

Ke Yang, Michel Galley, Chenglong Wang, Jianfeng Gao, Jiawei Han, ChengXiang Zhai / March 10, 2026

It seems counterintuitive: giving AI agents more memory can make them less effective. As interaction logs accumulate, they grow large, fill with irrelevant content, and become increasingly difficult to use. More memory means that agents must search through larger volumes of past interactions to find information relevant to the current task. Without structure, these records mix […]

The post PlugMem: Transforming raw agent interactions into reusable knowledge appeared first on Microsoft Research.

Research Blog

Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model

/ March 4, 2026

We are pleased to announce Phi-4-reasoning-vision-15B, a 15 billion parameter open‑weight multimodal reasoning model, available through Microsoft Foundry (opens in new tab), HuggingFace (opens in new tab) and GitHub (opens in new tab). Phi-4-reasoning-vision-15B is a broadly capable model that can be used for a wide array of vision-language tasks such as image captioning, asking […]

The post Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model appeared first on Microsoft Research.

Research Blog

CORPGEN advances AI agents for real work

/ February 26, 2026

By mid-morning, a typical knowledge worker is already juggling a client report, a budget spreadsheet, a slide deck, and an email backlog, all interdependent and all demanding attention at once. For AI agents to be genuinely useful in that environment, they will need to operate the same way, but today’s best models are evaluated one […]

The post CORPGEN advances AI agents for real work appeared first on Microsoft Research.

Research Blog

Media Authenticity Methods in Practice: Capabilities, Limitations, and Directions

Eric Horvitz, Andrew Jenks, Jessica Young / February 19, 2026

As synthetic media grows, verifying what’s real, and the origin of content, matters more than ever. Our latest report explores media integrity and authentication methods, their limits, and practical paths toward trustworthy provenance across images, audio, and video.

The post Media Authenticity Methods in Practice: Capabilities, Limitations, and Directions appeared first on Microsoft Research.