Deep Learning Weekly: Issue 455

This week in deep learning, we bring you Interaction Models: A Scalable Approach to Human-AI Collaboration, Hidden Technical Debt of AI Systems: Agent Harness and a paper on δ-mem: Efficient Online Memory for Large Language Models.

You may also enjoy Introducing Perceptron Mk1, Teaching Claude why, a paper on ProgramBench: Can Language Models Rebuild Programs From Scratch?, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!


Industry

Introducing Perceptron Mk1

Perceptron AI launches Mk1, a video and embodied-reasoning vision-language model priced roughly 80–90% cheaper than Claude Sonnet 4.5, GPT-5, and Gemini 3.1 Pro.

Notion just turned its workspace into a hub for AI agents

Notion launches Developer Platform turning its workspace into an agent orchestration hub with custom code Workers, external database sync, and native integrations for Claude Code, Cursor, Codex, and Decagon.

Interaction Models: A Scalable Approach to Human-AI Collaboration

Thinking Machines unveils TML-Interaction-Small, a 276B MoE (12B active) interaction model trained from scratch with 200ms time-aligned micro-turns that natively handles concurrent audio, video, and text without VAD-style harnesses.

Unsloth Joins the PyTorch Ecosystem

Unsloth joins the PyTorch Ecosystem Landscape, recognizing its open-source contributions including 2× faster training with 70% less VRAM, FP8 RL for consumer GPUs, and 250M+ model downloads.

MLOps/LLMOps/AgentOps

Hidden Technical Debt of AI Systems: Agent Harness

A Hanchung Lee essay reframing Sculley’s 2015 ML technical debt diagram for the agent era, arguing the agent runtime (harness + state) — not the model — is where most spend, incidents, and architectural debt are now accumulating.

Building Blocks for Foundation Model Training and Inference on AWS

A reference guide from Amazon mapping AWS’s four-layer infrastructure stack to foundation model pre-training, post-training, and inference workloads.

How to Eliminate Pipeline Friction in AI Model Serving

A practical NVIDIA guide laying out 18 best practices to eliminate AI model-serving friction across export issues, unsupported ops, dynamic input shapes, and version mismatches

Learning

Teaching Claude why

An Anthropic post detailing how teaching Claude why actions are aligned — via constitutional documents and ethical reasoning, not just demonstrations — drove blackmail rates from 96% (Opus 4) to 0% on every Claude model since Haiku 4.5.

Accelerating Gemma 4: faster inference with multi-token prediction drafters

Google releases Multi-Token Prediction (MTP) drafters for Gemma 4 models, delivering up to 3x faster inference via speculative decoding with zero quality degradation.

Vibe coding and agentic engineering are getting closer than I’d like

A post by Simon Willison observing that vibe coding and agentic engineering are converging in his own workflow as he increasingly ships production code from Claude Code without reviewing every line.

How fast is autonomous AI cyber capability advancing?

UK AISI reports the length of cyber tasks frontier models can autonomously complete is doubling every 4.7 months — accelerating from 8 months last November — with Claude Mythos Preview and GPT-5.5 exceeding even that trend.

Reimagining the mouse pointer for the AI era

A design-principles post from Google DeepMind reframing the mouse pointer as a Gemini-powered context-aware partner, built on four principles: maintain the flow, show and tell, embrace “this/that” deixis, and turn pixels into actionable entities.

Full Text Search: Architecture and Design

A technical architecture post from Pinecone introducing full-text search built on Tantivy, delivering Lucene query syntax, BM25 scoring, 18-language tokenization, and 22.7ms p50 latency on 6.4M Wikipedia articles.

Libraries & Code

comet-ml/opik

An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

UKGovernmentBEIS/inspect_ai

Inspect: A framework for large language model evaluations

Papers & Publications

δ-mem: Efficient Online Memory for Large Language Models

Abstract:

Large language models increasingly need to accumulate and reuse historical information in long-term assistants and agent systems. Simply expanding the context window is costly and often fails to ensure effective context utilization. We propose δ-mem, a lightweight memory mechanism that augments a frozen full-attention backbone with a compact online state of associative memory. δ-mem compresses past information into a fixed-size state matrix updated by delta-rule learning, and uses its readout to generate low-rank corrections to the backbone’s attention computation during generation. With only an 8×8 online memory state, δ-mem improves the average score to 1.10× that of the frozen backbone and 1.15× that of the strongest non-δ-mem memory baseline. It achieves larger gains on memory-heavy benchmarks, reaching 1.31× on MemoryAgentBench and 1.20× on LoCoMo, while largely preserving general capabilities. These results show that effective memory can be realized through a compact online state directly coupled with attention computation, without full fine-tuning, backbone replacement, or explicit context extension.

ProgramBench: Can Language Models Rebuild Programs From Scratch?

Abstract:

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable’s behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top