When we made the AINews → Substack move, we committed to having Matt Levine style op-eds every day, but some days there just isn’t much going on and we will just say so - we are working on small essays around inference demand and multiagents, but today is not that day.

Interesting model releases from Nvidia Nemotron, Poolside, and Alec Radford, but it’s unclear any of them will stand the test of time. GPT-6 hype is beginning.

AI News for 4/27/2026-4/28/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Inference Systems, vLLM 0.20, and the Hardware/Kernel Race Around DeepSeek V4

vLLM’s latest release is heavily about memory and MoE serving efficiency: vLLM v0.20.0 shipped with TurboQuant 2-bit KV cache for 4× KV capacity, FA4 re-enabled for MLA prefill on SM90+, a new vLLM IR foundation, fused RMSNorm for a reported 2.1% end-to-end latency improvement, plus support updates spanning DeepSeek V4 MegaMoE on Blackwell, Jetson Thor, ROCm, Intel XPU, and easier GB200/Grace-Blackwell setup. In parallel, SemiAnalysis highlighted early DeepSeek V4 Pro serving results on B200/B300/H200/GB200 disaggregated setups, claiming B300 can be up to 8× faster than H200 for this workload and pointing to upcoming vLLM 0.20 benchmarking with DeepGEMM MegaMoE, which fuses EP dispatch + EP combine + GEMMs + SwiGLU into a single mega-kernel.
DeepSeek support: several posts focused on serving tradeoffs: Jeremy Howard noted DeepSeek V4’s support for prefill as a capability many providers have dropped, while Maharshi pointed out the overheads of dynamic activation quantization, arguing that static quantization often wins on inference speed despite calibration cost. There was also growing interest in alternate stack portability: teortaxesTex argued DeepSeek is structurally moving away from CUDA lock-in via TileKernels, suggesting model vendors may increasingly optimize for heterogeneous or domestic accelerator fleets rather than NVIDIA-only deployment.

Open Model Releases: Poolside Laguna XS.2, NVIDIA Nemotron 3 Nano Omni, and TRELLIS.2

Poolside made its first public model release with an unusually deployment-friendly open-weight coder: @poolsideai announced Laguna XS.2, a 33B total / 3B active MoE coding model trained fully in-house, released under Apache 2.0, and advertised as able to run on a single GPU. Poolside’s broader release also included Laguna M.1 and an agent harness, emphasizing that the company trained from scratch on its own data, training infra, RL, and inference stack. Community summaries added more color: Aymeric Roucher described two coder models—225B/23B active and 33B/3B active—with hybrid attention, FP8 KV cache, and claimed performance near Qwen-3.5; Ollama shipped it immediately.
NVIDIA’s Nemotron 3 Nano Omni was the day’s biggest infra-native model launch: @NVIDIAAI introduced Nemotron 3 Nano Omni, an open 30B / A3B multimodal MoE with 256K context built for agentic workloads spanning text, image, video, audio, and documents. Distribution was immediate across the stack: OpenRouter, LM Studio, Ollama, Unsloth, fal, Fireworks, DeepInfra, Together, Baseten, Canonical, and others all announced same-day availability. Key specs surfaced in follow-on posts: Piotr Żelasko described it as NVIDIA’s first omni release with speech/audio understanding backed by a Parakeet encoder, English-only for now, and a 5.95% WER on the Open ASR leaderboard. Several hosts cited ~9× throughput versus comparable open omni models.
Other notable model/paper releases: Microsoft’s TRELLIS.2 is an open-source 4B image-to-3D model producing up to 1536³ PBR textured assets, built on native 3D VAEs with 16× spatial compression. On the world-model side, World-R1 claims existing video models already encode 3D structure and can be “woken up” with RL, requiring no architecture changes, no extra video training data, and no added inference cost.

Agents, Local-First Tooling, and Production Orchestration

Agent builders are shifting from demos to production primitives: Mistral launched Workflows in public preview as an orchestration layer aimed at turning enterprise AI processes into durable, observable, fault-tolerant production systems. Related posts echoed the same theme: Sydney Runkle framed durable execution as a key requirement for long-running agents, and threepointone described work on subagents / agents-as-tools with persistence, streaming, and resumption.
Local/offline agents moved from aspiration to credible workflow: Teknium asserted “totally offline agents are possible”, while Niels Rogge demoed Pi + local models for desktop cleanup and Google Gemma shared a tutorial for local coding agents. Hugging Face’s local push also showed up in adoption numbers: Clement Delangue said 300,000 users have added hardware specs to the Hub to discover what can run locally. Complementing this, Ammaar open-sourced a vibe-coding app running Gemma 4 fully on-device with MLX, and Kimmonismus highlighted Sigma, a private browser-based local-agent concept using open models.
Hermes and adjacent agent harnesses are gaining real-world traction: multiple posts reported Hermes outperforming OpenClaw in instruction-following or practical workflows, including SecretArjun, somewheresy, and users deploying Hermes through Telegram or for medical literature extraction. On the research-agent side, Hugging Face’s ML Intern was trending among Spaces, and later gained native metric logging + Trackio integration to make its training jobs observable rather than black-box.

Benchmarks, Evals, and Research Findings Worth Watching

Model benchmarking remains fragmented, but a few signals stood out: Epoch reported GPT-5.5 Pro reaching 159 on the Epoch Capabilities Index and new highs on FrontierMath—52% on Tiers 1–3 and 40% on Tier 4—including two Tier 4 problems not previously solved by any model. Separately, Greg Kamradt said ARC-AGI-3 testing for GPT-5.5 and Opus 4.7 had completed, with failure modes now under analysis.
Several new benchmarks target more realistic agent and engineering behavior: Lysandre announced a benchmark for making Transformers more agent-friendly, and VibeBench proposed subjective testing by 1,000 qualified software engineers to measure how models actually feel in real work. On document intelligence, LlamaIndex’s ParseBench emphasized that OCR benchmarks miss semantic formatting such as strikethroughs and superscripts, which materially alter meaning for agents.
Research notes with concrete engineering implications: Rosinality flagged bugs in DeepSpeed and OpenRLHF that reduce SFT performance, with implications for prior studies. Arjun Kocher published a faithful implementation of Compressed Sparse Attention from the DeepSeek-V4 paper. che_shr_cat showed single-block transformers can solve Extreme Sudoku only with an explicit scratchpad and inverted routing init, otherwise performance is zero. On optimization, Keller Jordan released a lightweight Modded-NanoGPT optimizer benchmark designed to compare methods like Muon and AdamW on a reproducible speedrun-style task.

Platform Economics, API Pricing, and Closed-Model Reliability Concerns

[AINews] not much happened today

AI Twitter Recap

Leave a Comment