LAI #124: The More You Tell a VLM, the Less It Sees

Plus the US-China distillation accusations, KV cache at scale, and three generations of agent pipelines

Good morning, AI enthusiasts!

Big one this week: we recently did a 2-hour workshop at the AI Engineer Summit in London, and it went so well that the organizers put the full recording on YouTube. So now you all get it for free.

Paul Iusztin, Samridhi from Towards AI, and I walked through building an MCP-powered deep research agent from scratch: planning a research strategy, searching the web, analyzing YouTube videos, gathering grounded evidence, filtering for relevance, and synthesizing everything into a cited research artifact. If you’re an AI engineer looking to build end-to-end agentic systems (or want AI to handle 90% of your writing without sounding like AI), this one’s for you. Watch it here.

We also cover:

How a research pipeline evolved across three generations: from hallucinating Claude Code skills to adversarial subagents to deterministic control flow with the Claude Agent SDK.
A workaround for Snowflake Cortex’s model limitations by wiring it to Groq for sub-second Llama 3 and Mixtral inference.
What happens when you need to serve hundreds of concurrent LLM users and the KV cache no longer fits in memory?
Why feeding more structured data to VLMs like Gemini actually makes them perform worse; fabricated detections at low confidence override correct ones.
The math behind diffusion models, from DDPM’s forward process through CLIP, DDIM, unCLIP, and Stable Diffusion’s latent-space compression.

Let’s get into it!

What’s AI Weekly

Between February 12 and 23, three major US AI labs have accused Chinese labs of stealing. Anthropic accused three Chinese AI labs of running what they call “industrial-scale distillation attacks” on Claude: 24,000 fake accounts and 16 million exchanges, coordinated proxy networks all designed to extract Claude’s reasoning, coding, and agentic capabilities. OpenAI sent a memo to the U.S. House Select Committee on China, accusing DeepSeek of “free-riding” on U.S. frontier-model capabilities. And even Google published a report documenting a 100,000-prompt campaign targeting Gemini’s reasoning traces.

So, this week, I will break down what’s actually being accused, how distillation works technically, the history that got us here, and the hypocrisy behind these stories. Read the complete article here or watch the video on YouTube.

AI Tip of the Day

Vector search is great at capturing meaning, but weak at exact matches. Things like product names, error codes, version numbers, or acronyms often get missed. For example, a query like “GPT-4o API error 429” might not rank well with pure vector search, even if that exact phrase exists in your data, because embeddings capture overall meaning, not precise tokens.

Hybrid search addresses this by combining vector search with BM25, a keyword-based method that scores exact-term matches. A common approach is to run both and merge the results using something like Reciprocal Rank Fusion. This gives you both semantic understanding and exact matching, without having to choose between them. Several modern databases, such as Weaviate, Qdrant, and Elasticsearch, support this natively.

If you’re building retrieval pipelines and want to go deeper into search strategies, evaluation, and the full production stack, check out our Full Stack AI Engineering course.

— Louis-François Bouchard, Towards AI Co-founder & Head of Community

Learn AI Together Community Section!

Featured Community post from the Discord

G023dev has built HarnessHarvester, a self-learning, self-correcting, LLM-powered harness creation and management system. It features a FAISS-powered RAG, sandboxed execution, and autonomous improvement modes. It also includes two autonomous modes: autolearn (a continuous discovery loop) and autoimprove (an iterative enhancement of existing harnesses). It is designed as an offline-first harness/scaffolding builder, so you get the harness instead of some remote api. Check it out on GitHub and support a fellow community member. If you have any questions, connect with him in the thread!

AI poll of the week!

A lot of you feel Claude (or your current provider) has slipped a bit over time, while a smaller group says it’s either steady or improving. That split makes sense; these systems change under the hood, and you usually notice it first as “wait… why is this suddenly harder than last month?” especially on the same prompts and tasks.

When you say it’s getting worse, what changed for you the most: more refusals/guardrails, weaker reasoning on hard questions, more hallucinations, lower coding accuracy, or a shift in tone (too verbose/too cautious)? Let’s do something about it in the thread!

Collaboration Opportunities

The Learn AI Together Discord community is flooding with collaboration opportunities. If you are excited to dive into applied AI, want a study partner, or even want to find a partner for your passion project, join the collaboration channel! Keep an eye on this section, too — we share cool opportunities every week!

1. Digvijay010606_44180 is looking for a study partner to learn data science and stay consistent. If you struggle with it and need an accountability partner as well, connect with him in the thread!

2. Lyraluthuin is working on an AI system to create a 3D model from human-drawn lines and is looking to connect with people in computer graphics animation technology to help scope out the remaining work. If this is your space, contact her in the thread!

3. Augmnt_sh is building an Agent Observability Protocol and is looking for a few people who are building autonomous agents and want to try AOP on their stack. If you want to contribute to the SDK and are interested in the observability/dev tooling space, connect with them in this thread!

Meme of the week!

Meme shared by bin4ry_d3struct0r

TAI Curated Section

Article of the week

From Claude Code Skills to Adversarial Subagent Orchestrators to the Claude Agent SDK: Three Generations of a Production Research Pipeline By Rick Hightower

Harness engineering separates toy agents from production systems, and this piece traces a research pipeline across three generations. Generation one chained Claude Code skills but hallucinated freely. Generation two added adversarial doer/judge subagents with retry prompts, though the orchestrator often ignored instructions. Generation three moved control flow into Python with the Claude Agent SDK, enforcing deterministic retries, Pydantic-validated structured output, budget caps, and human-in-the-loop escalation.

Our must-read articles

1. How I Built a Production-Grade Open-Source LLM Pipeline Using Groq and Snowflake By Satish Kumar

One limitation of Snowflake Cortex is that it supports only a handful of models and blocks External Access Integrations on trial accounts. This article shows how to wire Snowflake to Groq’s inference API for sub-second Llama 3 and Mixtral calls. It also covers 10 production use cases, including sentiment analysis, PII detection, SQL generation, and anomaly root cause analysis.

2. Inside LLM Inference: When the KV Cache No Longer Fits By Aanchal Karamchandani

Serving hundreds of concurrent LLM users turns inference into a memory management problem, not a compute one. The piece walks through how production systems tackle KV cache pressure using techniques like PagedAttention’s non-contiguous block allocation, inspired by OS virtual memory; prefix sharing across users with identical system prompts; prefill caching that retains computed blocks over time; attention-aware eviction policies; and INT8/INT4 quantization. Each technique compounds the others, cutting memory waste from 60–80% to under 4%.

3. VLM: The More You Tell it, The Less it Sees By Mradul Dubey

In this article, the author ran controlled experiments showing that feeding structured detection data to VLMs like Gemini 3 Flash actively suppresses visual reasoning on a surveillance clip. The same bounding box information, delivered as text, drawn overlays, or cross-modal references, produced vastly different anchoring bias severity. The results show that fabricated detections at low confidence override correct shoplifting identifications, and that every added metadata field monotonically degrades perception.

4. The Physics of Imagination: Visualizing the Hidden Mathematics of Diffusion Models By Shreyansh Jain

This piece walks through Diffusion models, tracing how controlled destruction becomes the foundation for modern machine imagination. The author covers DDPM’s Brownian-motion forward process and its noise-prediction training objective, CLIP’s 512-dimensional shared embedding space that gives generators a semantic compass, DDIM’s SDE-to-ODE shortcut that collapses 1,000 sampling steps into roughly 50, unCLIP’s two-stage prior-and-decoder architecture behind DALL-E 2, and Stable Diffusion’s latent-space compression using a lightweight VAE.

If you are interested in publishing with Towards AI, check our guidelines and sign up. We will publish your work to our network if it meets our editorial policies and standards.

LAI #124: The More You Tell a VLM, the Less It Sees was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.