The thesis: the model API call is about 2% of a production AI system's complexity. The surrounding infrastructure (model abstraction, session memory, RAG, tool integration, guardrails, observability, orchestration) is what determines whether a project ships or stays a demo.
We just published the first 4 chapters on Manning MEAP. It's a build-from-scratch Python book. You build a complete AI platform, service by service. Here's what's in the live chapters:
Platform SDK and API layer. platform = GenAIPlatform() lazily initializes service clients, and a workflow decorator packages each workflow as an independently deployable container. Internal communication uses gRPC with Protocol Buffers; external APIs support sync, async with job tracking, and streaming via SSE.
Model Service. Provider adapters that translate a unified interface to OpenAI, Anthropic, Google, and self-hosted models (vLLM/TGI/Ollama). Retry with exponential backoff, ordered fallback chains, and various routing strategies. Rate limiting operates at request, token, and concurrency levels across per-user, per-workflow, and global scopes. Caching works at two levels: response caching (exact match + semantic similarity) and provider-level prompt caching for repeated system prompts in RAG applications.
Session Service. Pluggable storage abstraction (PostgreSQL implemented in detail, interface supports Redis, Mongo, DynamoDB, etc.). The standout section is context window management: four strategies from simple truncation through hierarchical memory. Structured key-value memories at the top (inspired by MemGPT's "context as RAM, storage as disk" concept), compressed summaries of older conversation in the middle, and verbatim recent messages at the bottom, ordered to account for the "lost in the middle" attention phenomenon.
Chapter 1 is free: https://www.manning.com/books/designing-ai-systems
Code: https://github.com/designing-ai-systems/designing_ai_systems_repo
Happy to discuss any of the design choices.
[link] [comments]