Deep Learning Weekly: Issue 454

This week in deep learning, we bring you MolmoAct 2: An open foundation for robots, How to Work and Compound with AI and a paper on ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration.

You may also enjoy GPT-5.5 Instant, AI Chips: why they cost as much as a car, and why companies can’t get enough, a paper on OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

MolmoAct 2: An open foundation for robots that work in the real world

Ai2 releases MolmoAct 2, a fully open robotics foundation model running up to 37x faster than its predecessor, alongside the largest open-source bimanual manipulation dataset.

GPT-5.5 Instant: smarter, clearer, and more personalized

OpenAI releases GPT-5.5 Instant as ChatGPT’s new default model, cutting hallucinations by 52.5% on high-stakes prompts and using 30% fewer words while pulling context from past chats, files, and more.

New Compute Partnership with Anthropic

SpaceXAI signs a deal with rival Anthropic for full access to Colossus 1, with Anthropic also expressing interest in jointly developing multi-gigawatt orbital compute.

Blitzy raises $200M at $1.4B valuation to deploy thousands of coding agents in parallel

Blitzy raises $200M at $1.4B valuation to scale its enterprise platform that orchestrates thousands of parallel coding agents across 100M+ line legacy codebases, scoring 66.5% on SWE-Bench Pro.

Monday.com relaunches as an AI work platform with native agents

Monday.com relaunches as an “AI work platform” with native agents that draft campaigns, qualify leads, and triage tickets across its 250,000+ customers, plus one-click connectors to Claude, ChatGPT, Copilot, and Gemini.

MLOps/LLMOps/AgentOps

Introducing the Opik Agent Playground

Comet launches Opik Agent Playground, a UI-based environment for testing and tweaking full agent configurations (prompts, models, tools) without touching code, opening iteration to PMs and domain experts.

Learning

AI Chips: why they cost as much as a car, and why companies can’t get enough

A primer on AI chip economics and supply: the entire frontier flows through TSMC and a handful of designers, with total compute capacity doubling every 7 months while performance-per-dollar doubles every 2.5 years.

How Agents Manage Other Agents: Four Subagents Patterns in 2026

A practical blog post about four subagent orchestration patterns—inline tool, fan-out, agent pool, and teams—each requiring progressively more capable models and offering different tradeoffs in control, lifetime, and result collection.

Don’t Outsource Your Understanding

An essay arguing the real AI risk isn’t using it but “cognitive surrender”—outsourcing verification too—evidenced by 1,300+ hallucinated court filings and 50 ICLR papers with fake citations.

Granite 4.1 LLMs: How They’re Built

An technical deep-dive on how Granite 4.1 was built: 15T-token, five-phase pretraining with long-context extension to 512K, SFT on 4.1M curated samples, and on-policy GRPO with DAPO loss.

How to Work and Compound with AI

A practitioner’s playbook for compounding with AI: treat context as infra, encode taste as config (CLAUDE.md, skills), make verification cheap, delegate bigger chunks in parallel, and mine transcripts to close the loop.

Libraries & Code

comet-ml/opik

An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

zilliztech/claude-context

Claude Context is an MCP plugin that adds semantic code search to Claude Code and other AI coding agents, giving them deep context from your entire codebase.

Papers & Publications

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

Abstract:

This report describes ARIS (Auto-Research-in-sleep), an open-source research harness for autonomous research, including its architecture, assurance mechanisms, and early deployment experience. The performance of agent systems built on LLMs depends on both the model weights and the harness around them, which governs what information to store, retrieve, and present to the model. For long-horizon research workflows, the central failure mode is not a visible breakdown but a plausible unsupported success: a long-running agent can produce claims whose evidential support is incomplete, misreported, or silently inherited from the executor’s framing. Therefore, we present ARIS as a research harness that coordinates machine-learning research workflows through cross-model adversarial collaboration as a default configuration: an executor model drives forward progress while a reviewer from a different model family is recommended to critique intermediate artifacts and request revisions. ARIS has three architectural layers. The execution layer provides more than 65 reusable Markdown-defined skills, model integrations via MCP, a persistent research wiki for iterative reuse of prior findings, and deterministic figure generation. The orchestration layer coordinates five end-to-end workflows with adjustable effort settings and configurable routing to reviewer models. The assurance layer includes a three-stage process for checking whether experimental claims are supported by evidence: integrity verification, result-to-claim mapping, and claim auditing that cross-checks manuscript statements against the claim ledger and raw evidence, as well as a five-pass scientific-editing pipeline, mathematical-proof checks, and visual inspection of the rendered PDF. A prototype self-improvement loop records research traces and proposes harness improvements that are adopted only after reviewer approval.

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

Abstract:

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet their development remains dominated by industrial giants. The typical industry recipe involves a highly resource-intensive pipeline spanning pre-training, continual pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). In this report, we show that when fueled with informative and high-difficulty trajectories, a simple SFT approach could be surprisingly powerful for training frontier search agents. By introducing three simple data synthesis modifications: scaling knowledge graph size for richer exploration, expanding the tool set size for broader functionality, and strict low-step filtering, we establish a stronger baseline. Trained on merely 10.6k data points, our OpenSeeker-v2 achieves state-of-the-art performance across 4 benchmarks (30B-sized agents with ReAct paradigm): 46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity’s Last Exam, and 78.0% on xbench, surpassing even Tongyi DeepResearch trained with heavy CPT+SFT+RL pipeline, which achieves 43.4%, 46.7%, 32.9%, and 75.0%, respectively. Notably, OpenSeeker-v2 represents the first state-of-the-art search agent within its model scale and paradigm to be developed by a purely academic team using only SFT. We are excited to open-source the OpenSeeker-v2 model weights and share our simple yet effective findings to make frontier search agent research more accessible to the community.

Industry

MLOps/LLMOps/AgentOps

Learning

Libraries & Code

Papers & Publications

Leave a Comment