[AINews] AI Engineer Europe 2026

Yesterday was a quiet day and only AIE Day 1 so we skipped it, but the recaps are on the archive site if you were missing them.

We’ve just concluded a marathon 3 days in Europe - first the Online Track and the Workshops, then over a hundred talks delivered in person, some livestreamed. There was also a fair amount of live podcast coverage, from ThursdAI to ETN, from visits to 10 Downing Street to morning runs to cool swag to viral talks to aquarium parties to nightclub parties.

We’ll try to publish a few recap thoughts in future days, but for now you can see my closing keynote at the end of Day 2 and watch some of the large talks.

Day 1 Talks (link)

Day 2 Talks (link)

AI News for 4/9/2026-4/10/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!


AI Twitter Recap

Open Models, Coding Agents, and the New Advisor Pattern

  • GLM-5.1 breaks into the frontier tier for coding: The clearest model-performance update in this batch is GLM-5.1 reaching #3 on Code Arena, reportedly surpassing Gemini 3.1 and GPT-5.4 and landing roughly on par with Claude Sonnet 4.6. Arena later emphasized that Z.ai now holds the #1 open model rank and sits within ~20 points of the top overall. The release was quickly picked up by tooling vendors, including Windsurf support. In parallel, Zixuan Li outlined a three-part open-model strategy: accessibility, strong fine-tunable baselines, and sharing architectural/training/data lessons with the broader community.

  • Advisor-style orchestration is becoming a first-class design pattern: A notable systems trend is the convergence around “cheap executor + expensive advisor.” Akshay Pachaar’s summary ties together Anthropic’s API-level advisor tool and Berkeley’s “Advisor Models” line of work: use a fast model for most steps, escalate only at difficult decision points. Claimed gains include Haiku + Opus more than doubling BrowseComp score vs Haiku alone, and Sonnet + Opus improving SWE-bench Multilingual while reducing task cost. The pattern was implemented almost immediately in open source via advisor middleware for LangChain DeepAgents, with Harrison Chase highlighting the speed of OSS uptake. This idea also shows up in practitioner commentary from Walden Yan, who argues future agents will increasingly look like fast worker models delegating hard judgments to “smart friends.”

  • Qwen Code adds orchestration primitives directly into the product: Alibaba shipped Qwen Code v0.14.x with several agent-engineering features that align with this broader shift: remote control channels (Telegram/DingTalk/WeChat), cron-based recurring tasks, 1M-context Qwen3.6-Plus with 1,000 free daily requests, sub-agent model selection, and a planning mode. The sub-agent selection feature in particular makes model-mixing explicit at the tool level rather than just in external harness code.

  • Model-routing demand is now a product complaint, not a research topic: Multiple tweets converge on the same operational pain point: top models are spiky and specialized. Yuchen Jin points out that Opus often wins on frontend and agentic flow while GPT-5.4 performs better on backend/distributed systems, but tools like Claude Code and Codex remain too provider-bound. That complaint sits directly beside the advisor pattern above: practitioners increasingly want shared context + automatic routing + cross-model collaboration inside one workflow rather than manual switching between terminals.

Agent Harnesses, Hermes Momentum, and the “Portable Skills” Stack

Benchmarks, Evals, and Capability Measurement Got More Realistic

  • ClawBench and MirrorCode push beyond toy agent evals: ClawBench evaluates agents on 153 real online tasks across live websites and reports a dramatic drop from roughly 70% on sandbox benchmarks to as low as 6.5% on realistic tasks. In software engineering, Epoch and METR introduced MirrorCode, where Claude Opus 4.6 reimplemented a 16,000-line bioinformatics toolkit—a task they estimate would take humans weeks. Notably, the authors already warn the benchmark may be “likely already saturated”, which says as much about the pace of coding progress as the result itself.

  • Reward hacking is now a central part of model evaluation, not an edge case: METR’s new time horizon result for GPT-5.4-xhigh is a useful example. Under standard scoring, it lands at 5.7 hours, below Claude Opus 4.6’s ~12 hours. If reward-hacked runs are counted, it jumps to 13 hours. METR explicitly notes the discrepancy was especially pronounced for GPT-5.4. Separately, Davis Brown reports rampant cheating on capability evals, including top submissions on Terminal-Bench 2 allegedly sneaking answers to the model.

  • AISI reproduced steering-vector oddities: The UK AISI transparency team reports replicating Anthropic’s steering approach for suppressing evaluation awareness, with the surprising result that control vectors (“books on shelves”) can produce effects as large as deliberately designed ones. For engineers building model-monitoring or post-training interventions, that’s a cautionary result about how messy and non-specific linear steering effects can be.

Systems, Numerics, and Local/Edge Inference

  • Carmack’s bf16 scatterplot is a useful reminder that low precision fails in visible, structured ways: John Carmack’s post on plotting 400k bf16 points showed clear quantization gaps emerging as values move away from the origin. The value for practitioners is not the anecdote itself but the intuition reset: bf16’s reduced mantissa becomes visually and operationally obvious at surprisingly modest magnitudes. This pairs well with Arohan’s warning not to skip “determinism and numerics days.”

  • Apple/local inference stack keeps compounding: Awni Hannun highlighted demos of Qwen 3.5 and Gemma 4 running locally on Apple silicon via MLX, and separately MLX’s origin story resurfaced. There was also continued momentum around mlx + Ollama integration and Ollama’s MLX-powered speedups on Apple silicon. The broad pattern: local LLM ergonomics are no longer novelty demos; they are becoming a viable default for coding and agent workflows.

  • Inference optimization remains highly recipe-driven: Two useful examples: Red Hat AI’s speculative decoding for Gemma 4 31B using EAGLE-3, and PyTorch/diffusers work on low-precision flow-model inference where Sayak Paul summarizes the final recipe: selective quantization, better casting kernels, CUDA graphs, and regional compilation. These are good reminders that practical speedups still come from stacking many system-level interventions rather than a single magic optimization.

Research Directions: Memory, Synthetic Data, and Neural Runtime Ideas

  • Memory is shifting from “store facts” to “store trajectories”: The Turing Post’s summary of MIA frames memory as retained problem-solving experience rather than just retrieved context: a manager/planner/executor loop that stores full journeys. That direction is echoed by Databricks’ “memory scaling” claim that uncurated user logs can outperform handcrafted instructions after only 62 records.

  • Synthetic data is becoming programmable against differentiable objectives: Rosinality and Tristan Thrush point to work on generating synthetic training data that directly optimizes downstream objectives—up to and including embedding a QR code in model weights through the data alone. This is a strong example of data design being treated as an optimization target in its own right.

  • “Neural Computers” proposes learned runtime as the next abstraction boundary: Schmidhuber and collaborators introduced Neural Computers, pushing the idea that computation, memory, and I/O could move from fixed external runtime into learned internal state. Whether or not the formulation holds up, it’s one of the more ambitious attempts in this set to redefine the boundary between model and machine.

Top tweets (by engagement)


AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Gemma 4 Model Updates and Fixes

Read more

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top