The 10 Questions That Decide Whether You’re an AI Engineer or Just an AI User

I failed my first AI engineer interview. Here’s the complete playbook I built to never fail again.

The email came at 11:43 PM on a Tuesday.

“Thank you for taking the time to interview with us. After careful consideration, we’ve decided to move forward with other candidates…”

I stared at my laptop screen for a long time. The blue light felt accusatory.

It had been a 90-minute technical interview for a Senior AI Engineer role at a company I’d spent three years wanting to work at. I had shipped LLM features. I had written RAG pipelines. I had production experience with vector databases, embeddings, and prompt engineering. On paper, I was qualified.

But somewhere between “Tell me how you handle hallucinations in LLMs?” and “Design a system that processes 10,000 documents daily,” I had unraveled.

Not because I didn’t know the concepts. I did most but would say loosely. The way most engineers who’ve read a few blog posts and shipped a few demos “know” things.

But knowing about something and being able to architect a production solution to it under pressure, in real time, with someone who builds these systems for a living watching you — those are two completely different skills.

That night, I started building something. A document which i consider as a personal playbook. A brutal, comprehensive answer to every hard question I’d fumbled.

What follows is that playbook. It’s long. It’s meant to be. Each section gives you what the interviewer is actually looking for, the complete technical answer, and the one thing that separates the candidates who get hired from the ones who get polite rejection emails at midnight.

Before We Start: What This Article Is and Isn’t

This is not a cheat sheet.

If you read each section below and treat it as bullet points to recite back verbatim, you will fail your interview in exactly the same way I failed mine. Interviewers who ask these questions have heard hundreds of rehearsed answers. They know what a rehearsed answer sounds like. It sounds confident for about ninety seconds and then collapses the moment a follow-up question probes one level deeper.

What this article is: a framework for understanding the ten hard problems that define AI engineering in production. If you understand the problems I would say please really understand them, the way how you understand abd why a database index works or why stateless services are easier to scale? then next you can answer any variant of these types of questions, handle any follow-up the interviewer they have, and adapt or repharse your answer to the specific context an interviewer brings for you.

I’ve structured each section the same way:

The question and why interviewers ask it (what they’re actually testing for)
The complete technical answer, built from real production experience
The interview move → the one thing that separates the hired candidates from the ones who get rejection emails

Approximate reading time: 25–30 minutes. Make a coffee.

The Landscape: What It Means to Be an “AI Engineer” in 2025

Before we dive into the ten questions, let’s agree on something uncomfortable:

The title “AI Engineer” has been diluted to near meaninglessness.

There are AI engineers who can barely explain the difference between a transformer and a recurrent network. There are AI engineers who have shipped zero production systems. There are AI engineers whose entire skillset is: “write a prompt, call the OpenAI API, hope for the best.”

And there are AI engineers who think in systems and who can reason about cost, reliability, latency, safety, and quality simultaneously. Who’ve debugged a production failure at 2 AM because their embeddings model started behaving differently after a provider update. Who can design a document summarisation pipeline for 10,000 documents a day with cost math on the back of a napkin.

Interviewers are trying to find the second kind. These ten questions are their tool for doing it.

Let’s go through them one by one not as bullet points to memorise, but as problems to deeply understand and would recommend to repharse or add your thoughts if you do.

Question 01 · Embeddings & Vector Search

“Have you worked with embeddings before? How did you choose the right model, and what scaling issues did you run into?”

Embeddings are the backbone of modern AI systems like RAG, semantic search, recommendations, anomaly detection. If you’re building anything that requires understanding meaning (not just matching keywords), you need embeddings.

Remember below steps:

Why they ask this: Embeddings are the foundation of modern AI systems from search to recommendations to RAG. Deep understanding here signals strong fundamentals.

What they are:

Embeddings are dense vector representations of data → text, images, audio where semantically similar items are close together in vector space. The model doesn’t just encode tokens; it encodes meaning. “King” is closer to “Queen” than to “Bicycle” not because of spelling, but because of semantic relationship.

How they work:

A neural network maps input text to a fixed-size vector typically 1,536 dimensions for text-embedding-3-small, 3,072 for the large variant. You then measure semantic similarity between two vectors using cosine similarity or dot product.

Production use cases:

Semantic search: Find documents by meaning, not keyword matching. A query for “cardiac arrest” surfaces results about “heart attack” without any keyword overlap.
RAG retrieval: The core retrieval mechanism in every RAG system.
Duplicate detection: Cluster near-identical content even if it’s phrased differently.
Recommendation engines: Surface similar products, articles, or content.
Clustering and topic modelling: Group documents by theme without predefined labels.
Anomaly detection: Identify outlier queries or documents that are semantically distant from the expected distribution.

Choosing a model:

Consider: dimensionality (higher = more expressive, slower, pricier), domain fit (general-purpose vs. domain-specific models fine-tuned on legal, medical, or technical text), multilingual support, and cost. Always benchmark on your data — MTEB leaderboard rankings don’t tell you how a model performs on your specific domain.

Scaling considerations:

At a few hundred thousand vectors, exact search is fine. Beyond that, exact nearest neighbour search becomes too slow. You need Approximate Nearest Neighbour (ANN) algorithms, HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index). Choose the right index type for your recall/latency trade-off. Pinecone, Weaviate, Qdrant, and pgvector all offer configurable indexing.

The interview move: If it’s a whiteboard interview, sketch the pipeline: Query → Embed → ANN Search → Top-K → Re-rank → Generate. Visual communication of the pipeline is a genuine differentiator.

Question 02 — RAG

“Did you design RAG? Have you us walk through the entire process in RAG? Did you see any challenges during building RAG pipeline?

Figure 1. Full RAG pipeline — offline index build (bottom) and online query processing (top), from document ingestion through to evaluation

I also make this below diagram to make the concept clarification:

Fig2: Same as fig1, just make it more clear

RAG is the architecture pattern of modern applied AI. If you claim to be an AI engineer and you can’t design a production RAG system end-to-end, you have a significant credibility problem.

Why they ask this: RAG is the most common LLM architecture pattern in production. If you can’t design one end-to-end, you’ll struggle in any applied AI role.

The first RAG system I built had six components. The production version we run now has fifteen. Here’s the complete path from naïve to production-grade:

Answer framework for Q2 — the five stages of a production RAG pipeline, with the step most candidates skip highlighted

Stage 1: Document Ingestion

Parse source documents such as PDFs, HTML, databases, Markdown. Clean the text: strip headers/footers, normalize whitespace, remove artefacts. Then chunk: split into semantically coherent pieces.

Chunk size matters enormously. Too small: you lose context. Too large: you dilute relevance. The empirically validated sweet spot is typically 200–500 tokens with overlap (so no information is cut off at chunk boundaries).

Stage 2: Embedding & Indexing

Embed each chunk using a model like text-embedding-3-small (OpenAI) or an open-source alternative. Store embeddings in a vector database with metadata document ID, source, date, section title for filtering at query time.

Model choice matters: consider dimensionality (higher = more expressive but slower), domain fit (general-purpose vs. domain-specific), multilingual support, and cost. Benchmark on your data, not MTEB leaderboard scores.

Stage 3: Retrieval

At query time, embed the user’s question and perform similarity search cosine or dot product. Retrieve top-k chunks.

Here’s where most naïve implementations stop. Production systems add: hybrid search (combining vector similarity with BM25 keyword matching). This catches cases where exact terms (product names, model numbers, proper nouns) matter more than semantic similarity.

Stage 4: Re-Ranking ← The step most candidates skip

Raw retrieval isn’t enough. Use a cross-encoder re-ranker (Cohere Rerank, BGE reranker) to re-score your top-k chunks for actual relevance to the question. Cross-encoders attend to both the query and the document simultaneously, which dramatically improves precision.

This is the single step that most distinguishes a basic RAG from a production-quality one.

Stage 5: Generation

Feed the re-ranked context + user query into the LLM with a well-crafted system prompt: answer only from the provided context, cite sources, if the answer isn’t present, say so explicitly. Include source attribution in the output.

Stage 6: Evaluation Loop

Measure retrieval recall (are the right chunks being retrieved?), answer faithfulness (does the answer match the source?), and answer relevance. Use frameworks like RAGAS or custom eval suites.

Stage 6: Evaluation Loop

Measure retrieval recall (are the right chunks being retrieved?), answer faithfulness (does the answer match the source?), and answer relevance. Use frameworks like RAGAS or custom eval suites.

Common RAG Failure Modes to Know

Beyond the happy path, production RAG systems fail in predictable ways that interviewers sometimes probe:

Retrieval failures: The right chunk exists in the database but isn’t retrieved — because the query embedding and document embedding don’t land close in vector space even though they’re semantically related. This is why hybrid search (vector + BM25) matters, and why re-ranking is not optional.

Context poisoning: In long conversations, the context window fills with retrieved chunks from earlier turns that are no longer relevant, crowding out the correct chunks for the current question. Context management — pruning stale retrieved content — is a real production concern.

Chunk boundary failures: A sentence starts in one chunk and ends in another. With no overlap strategy, the model sees incomplete thoughts. Overlap at chunk boundaries (50–100 token overlap) is the standard mitigation.

Metadata filtering bugs: If your vector store filters on metadata (date, category, user ID) and that metadata is wrong or inconsistently populated, you’ll silently retrieve irrelevant results with no obvious error signal.

The interview move: The re-ranking step is your differentiator. Most candidates skip it. Mentioning it as a dedicated stage immediately signals production maturity.

Question 03 — LLM Hallucinations

“Have you dealt with hallucinations in a live system? What layers of defense did you put in place?”

This is almost always the first question. And it’s where most engineers make their first mistake: they treat it as a question about models.

It isn’t.

It’s a question about architecture.

Why they ask this: Hallucination is the number one reason companies don’t trust AI systems in production. If you’re building anything customer-facing, you need to know how to stop a model from confidently making things up. Interviewers want to see if you understand that this is an engineering problem, not a model problem.

The summer I shipped my first customer-facing LLM feature, we had a support ticket that read: “Your AI told me the return window for my order was 60 days. It’s 30. I want my money back.”

The model wasn’t wrong in a stupid way. It was wrong in a confident, fluent, completely believable way. That’s what makes hallucination genuinely dangerous. Model gives the answer in a very confident way and even don’t give you a chance to doubt it.

Here’s how you build a system that defends against it:

Layer 1: Retrieval-Augmented Generation (RAG)

The most foundational defence in AI world, instead of asking the model to answer from its parametric memory, the weights baked in at training timeso you force it to answer from a specific set of documents you control.

The model retrieves relevant chunks from your vector store (Pinecone, Weaviate, pgvector, Qdrant), and your system prompt instructs it: answer only from the provided context. If the answer isn’t here, say I don’t know.

This doesn’t eliminate hallucination entirely, models can still misread context, misattribute quotes, or confuse chunks but it dramatically reduces fabrication.

Layer 2: Chain-of-Thought / Step-by-Step Reasoning

When you force the model to show its reasoning before arriving at a conclusion, you structurally reduce the probability of a confident but baseless answer. A model that reasons step-by-step is less likely to skip directly to a plausible-sounding falsehood.

You can enforce this via prompting (“think through this step by step before answering”) or via structured output schemas that require the model to populate a reasoning field before the answer field.

Layer 3: Output Validation & Guardrails

Post-generation checks. You can use:

Fact-checking against your source documents: does the answer’s claims appear in the retrieved context?
NLI classifiers that flag “unsupported” statements (models fine-tuned to detect entailment vs. contradiction)
A secondary model call specifically tasked with verification

Layer 4: Constrained Decoding

If your answer space is bounded just only yes/no, one of five categories, a structured JSON object don’t let the model free-generate. Always, Use function calling or structured output schemas. The fewer degrees of freedom you give the model, the fewer ways it can go wrong so be cautious.

Layer 5: Confidence Calibration

Use log-probabilities or self-consistency sampling. Generate multiple outputs, look for consensus. If the model’s outputs are highly inconsistent, that’s a signal the model is uncertain. Build an “abstain” pathway: “I don’t have enough information to answer this reliably.”

The interview move: Mention that you combine RAG and output validation as a layered defense. The most sophisticated candidates say: “No single technique is enough. Production systems need defense in depth just like security.”

Take this for your mental mode:

Fig1: Figure 3. Hallucination defense pipeline

Question 04 · Evaluation Strategy

“How did you measure whether your LLM feature was actually working well in production?”

If you can’t measure quality, you can’t improve it. This question is about whether you’ve built rigorous systems around your AI or whether you’re just hoping things are working.

Why they ask this: Companies need engineers who can set up rigorous evaluation before shipping. “It seems good” is not a quality standard.

Keep these points for you easy understanding before explanation.

Answer framework for Q4— five evaluation methods, from automated metrics to human review, with the interview-ready one-liner at the bottom

Task-Specific Metrics

Match your metric to your task:

Classification: accuracy, F1, precision/recall
Summarization: ROUGE, BERT-Score
Translation: BLEU, COMET
Generation: perplexity, coherence scores

But always ask: does this metric correlate with what users actually care about? A high ROUGE score on summaries that users find useless is worthless.

LLM-as-Judge

Use a stronger model to evaluate a weaker model’s outputs on criteria like helpfulness, accuracy, safety, and style adherence. This scales better than human evaluation for fast iteration loops.

The key is structured rubrics. Vague prompts like “rate this answer out of 10” produce inconsistent, biased judgements. Instead: provide specific criteria, examples of good/bad outputs, and chain-of-thought instructions. This dramatically reduces judge variance.

Human Evaluation

Still the gold standard for subjective quality. A/B tests with real users. Thumbs up/down at scale. Time-on-task metrics. Task completion rates. The signal from 100 real users tells you more than any automated metric.

Adversarial Testing

Red-team your model deliberately. Test jailbreaks, edge cases, out-of-domain queries, adversarial prompts. Evaluate not just whether the model answers well but how gracefully it fails when it doesn’t know the answer.

Regression Testing

This is the discipline that separates production systems from demos: maintain a test suite of known-good input/output pairs. Every time you change the prompt, the model, or the pipeline — re-run the suite before shipping. Catch regressions before they reach users.

The interview move: “I use LLM-as-Judge for fast iteration loops and human eval for final sign-off. This gives me the speed of automated evaluation with the accuracy of human judgement.”

Question 05 · Deployment Strategy/Production Debugging

“Walk us a complete model deployment Strategy?”

“Has a model ever behaved differently in production vs testing for you? How did you debug it?”

Follow up → How did you plan if something goes wrong in production after you deploy your model?

This question is a trap in the best possible way I would say.

Every engineer who has only ever run LLM experiments in a notebook will answer this question incorrectly. They’ll say something like “the model might need retraining” and stop there.

Engineers who’ve actually shipped to production will immediately start listing the real culprits.

Why they ask this: This shows whether you’ve shipped. Most junior engineers have only run models in notebooks. This question surfaces the gap between “I built a demo” and “I run production systems.”

Remember these steps before answering:

Answer framework for Q5 — Has your model ever behaved differently in production vs testing? How did you debug it?

The first time my model passed every evaluation metric and then immediately degraded in production, I thought I had a model problem. I didn’t. I had a data problem, a prompt problem, a latency problem, and a monitoring problem — all at once.

Here’s the complete diagnostic framework:

Data Drift

Production data is never identical to your test set. Real users are messier. They type with slang, typos, incomplete sentences. They ask about edge cases your training data never covered. They phrase things in ways that trigger totally different model behaviors.

Solution: Monitor input distributions continuously. Set up drift detection alerts. Track the statistical distance between your test distribution and your production distribution.

Train-Test Leakage

If your test performance was suspiciously good like, too good, you may have accidentally leaked training data into your evaluation set. Re-audit your data pipeline for temporal leakage or overlap. This is embarrassingly common and almost no one admits it.

Evaluation Mismatch

Your offline metrics (F1, BLEU, perplexity) may not correlate with what users actually care about. A model that scores 0.94 on your eval suite can still feel terrible to use.

Build eval suites that use real user queries, real user feedback, A/B tests with real users. Track thumbs up/down, time-on-task, task completion rates. These are the metrics that actually matter.

Prompt Sensitivity

Tiny changes in how users phrase their requests can produce wildly different outputs. Your test prompts were probably carefully constructed. Real users are not. Production prompts get modified by real user input in ways that completely change the model’s response.

Solution: Adversarial prompt testing. Deliberately try to break your prompts. See how fragile they are before you ship.

Latency & Timeout Issues

A model that performs fine under test conditions may hit rate limits, timeout, or degrade under concurrent load. Profile your inference pipeline under realistic traffic patterns — including burst traffic.

Context Window Overflow

In production, conversations grow longer over time. If you’re stuffing context (chat history, retrieved documents, tool outputs) without a management strategy, you will silently lose information when the context window limit is hit. This leads to degraded responses that are very hard to diagnose because there’s no explicit error.

The interview move: Lead with the monitoring framework: “I’d set up three layers of monitoring — input distribution tracking, output quality scoring, and user feedback loops — then systematically narrow down the failure mode.”

Question 06 · Fine-Tuning Vs Prompting

“Have you fine-tuned a model before? How did you decide that was the right call over prompting?”

This is a pragmatic trade-off question. The wrong answer is: “I’d fine-tune for better performance.” The right answer is nuanced, cost-aware, and grounded in real decisions.

Why they ask this: This tests your ability to make pragmatic trade-off decisions, arguably the most important skill in applied AI engineering.

Here’s the honest decision framework I’ve built over years of making this call:

Start with prompt engineering. Always.

Same as usual, keep these steps when explaining:

Answer framework for Q6: How did you do prompting and fine-tunning?

It’s faster. It’s cheaper. It’s reversible. Few-shot examples, chain-of-thought, structured output schemas, system prompts, these tools cover 80% of tasks at a fraction of the cost and time of fine-tuning. If you can get the behavior you need with a well-crafted prompt, that’s the right answer.

Fine-tune when:

You need consistent output format or style that prompting can’t reliably enforce. If you need every response in a specific JSON schema, a specific writing style, or a specific persona and prompting gives you 85% consistency but you need 99% of fine-tuning bakes the behavior into the weights.
You have a narrow, well-defined task with abundant labelled examples (hundreds to thousands of high-quality input/output pairs). Classification, extraction, transformation narrow tasks with clear right answers benefit most from fine-tuning.
You need to reduce latency by removing long system prompts. A fine-tuned model can have its behavior encoded in its weights, so you don’t need a 2,000-token system prompt on every call.
You want to distil a large model’s behavior into a smaller, cheaper model. Train a small model to mimic a large model’s outputs on your specific task.

Don’t fine-tune when:

You have fewer than a few hundred high-quality examples
Your task changes frequently, retraining is expensive and slow
You need broad general knowledge, fine-tuning on narrow data causes catastrophic forgetting of everything else
The base model already performs well with the right prompt

The Hybrid Approach

The most sophisticated pattern: fine-tune a base model on your domain (to give it domain-adapted language and formatting), then layer RAG on top for factual grounding. You get both: stylistic consistency from fine-tuning, factual accuracy from retrieval.

The interview move: Show cost-time thinking: “I’d spend one day on prompt engineering before committing two weeks to a fine-tuning pipeline. The iteration speed difference is enormous.”

Figure -4: Prompt Engineering Vs Fine Tuning Decision Making

Question 07 · Cost Reduction strategy

“How can you evaluate the cost and walk us your strategies to reduce the cost?”

“Have you ever had to cut LLM API costs in production? What did you actually do?”

Figure 5: Cost optimization decision tree — routing incoming requests through complexity classification, model selection, caching, batching, and fine-tuning decisions to minimize API spend without sacrificing output quality

Explanation figure -5: 
The flow starts with a semantic cache check, so if a nearly identical question was answered before, the system can return the response instantly at almost no cost.

If the query is new, it gets classified by complexity. Straightforward questions go to a small, inexpensive model, while harder reasoning tasks are routed to a larger model only when necessary.

Non‑urgent workloads are batched, letting you process many requests together and take advantage of bulk API discounts that cut costs roughly in half.

On the output side, prompt compression and output‑length control help eliminate unnecessary tokens, reducing how much you pay for generation without hurting quality.

Finally, a fine‑tuning decision gate evaluates whether the task is narrow and repetitive enough to justify training a small custom model. If it is, you can remove the system prompt entirely and unlock major long‑term savings for high‑volume use cases.

By the time this question comes up, you’ve probably been talking for ten minutes. Good interviewers will watch carefully here, because cost thinking is maturity thinking.

Junior engineers reach for the biggest model available. Senior engineers understand that every token costs money, and that money is a real engineering constraint, not a CFO problem.

Why they ask this: AI is expensive. Companies don’t just want engineers who can build, also, they want engineers who can build without burning through the compute budget in three months. Minimize the budget, the whole idea you need to learn and focus you answer so that it makes them think to hire you instead of the person who has good understanding of theory and python coding.

Memorize these steps before answering the cost related questions:

Answer framework for Q.7 cost reduction strategy

You can start explaining the budgeting stuff something like six months after our hallucination incident, we got a different kind of problem: our monthly API bill had quietly grown from $4,000 to $31,000 without anyone noticing. And try to connect the dots such as that’s time when I learned that cost optimization is not an afterthought. It’s an architectural decision you make at the beginning.

Here’s the complete toolkit:

Model Selection & Routing

This is the highest-leverage move. The principle: use the smallest model that can reliably handle the task.

Build a router that classifies incoming requests by complexity. Simple queries such as “What are your store hours?” go to a fast, cheap model (Haiku-class or a fine-tuned small model). Complex, multi-step reasoning tasks go to the large model. Don’t forget to explain in our system, this alone cut costs by 60% while keeping user satisfaction virtually unchanged.

Prompt Optimization

Shorter prompts = fewer input tokens = lower cost. Review your system prompts ruthlessly. Remove redundant examples. Tighten instructions. Compress few-shot examples. Every token you trim at scale is money.

Also: use prompt caching. Most providers now offer caching on static prompt prefixes. If your system prompt is 2,000 tokens and every user query re-sends it, that’s 2,000 tokens of waste per call.

Semantic Caching

Instead of making a new API call for every request, store previous responses keyed by embedding similarity. When a new query comes in, check if a semantically near-identical question has already been answered. If yes, return the cached response.

This is especially powerful in domains with high query repetition: customer support, internal FAQs, product documentation.

Batching & Async Processing

Not everything needs to be real-time. Batch non-urgent requests and submit them through batch endpoints most providers offer these at 50% discount. Summarization jobs, analytics, classification pipelines, overnight processing all of these can be batched.

Fine-Tuning for Efficiency

A fine-tuned small model can frequently outperform a general-purpose large model on a specific narrow task at a fraction of the inference cost. The upfront training cost pays itself back quickly at scale. Fine-tuning also lets you strip long system prompts because the model’s behavior is baked into the weights.

Output Length Control

Set max_tokens appropriately and use structured output schemas to get exactly what you need. Don't let the model generate 2,000 tokens of explanation when the answer is 200 tokens.

The interview move: Quantify. “In our system, model routing reduced API costs by 70% while maintaining 95% of output quality.” Interviewers remember numbers because numbers signal that you’ve actually shipped these things.

Question 08 · Agents and Tool Use

“Have you built an agent that uses external tools? Walk me through how you structured it and what broke first.”

Agentic AI is where the field is moving. The ability to design agents that don’t just generate text but take actions, call APIs, run code, query databases, send emails is the most commercially valuable AI engineering skill right now.

Why they ask this: Agentic AI is the biggest trend in applied AI. Companies are building agents that can take real actions, not just generate text. This question tests whether you can architect them.

Answer framework for Q7 — Agents and tool use

The Core Architecture: Observe → Think → Act

An agent is an LLM in a loop:

Observe: Read the current state — input, tool results, memory
Think: Reason about what to do next — which tool to call, with what parameters, or whether to respond
Act: Call a tool or generate the final response

This loop continues until the agent reaches a terminal state or a maximum iteration limit.

Tool Definition

Tools are defined with clear names, descriptions, and JSON schemas for parameters. The quality of your tool descriptions directly determines how reliably the model uses them. Treat tool schemas like API documentation they need to be precise, complete, and unambiguous.

Orchestration Patterns

Different tasks call for different patterns:

ReAct (Reason + Act): The model reasons step-by-step before each action. Good for exploratory, multi-step tasks.
Plan-then-Execute: The model generates a full plan first, then executes each step. Better for complex tasks with clear structure.
Map-Reduce: Parallelize subtasks across multiple agent instances. Good for large-scale processing.

Error Handling

Tools fail. APIs time out. Rate limits trigger. A production agent needs: retry logic with exponential backoff, fallback strategies, graceful degradation, maximum iteration limits (to prevent infinite loops), and comprehensive logging of every step for debugging.

Safety & Guardrails

Agents with real-world tool access are dangerous. Implement: permission systems (the agent can only call pre-approved tools, can only read certain data), human-in-the-loop approval for high-stakes actions (sending emails, making purchases, modifying databases), sandboxed execution environments, and output validation before any tool call executes.

State Management

Agents need to track state across turns: what tools have been called, what results have been returned, what’s been attempted and failed. This state management is often underestimated in architecture. Two approaches: in-context state (append everything to the context window simple but expensive) and external state (a structured object in a database, referenced by the agent more complex but scalable and inspectable).

LangGraph’s explicit state graph approach is excellent here: the agent’s state is a typed Python object that flows through the graph, making it debuggable, resumable, and easy to test.

A Real Pattern: The Research Agent

To make this concrete: imagine an agent that answers questions about your company’s internal data. The loop looks like this:

User asks a question
Agent calls a search_knowledge_base tool → gets 5 results
Agent reasons: “I need more specific data” → calls query_database with a SQL query
Agent reasons: “I have enough context” → generates final answer
Output guardrail checks the answer for PII → strips employee names
Response returned to user

The power and the danger is in step 3. The agent is generating and executing SQL dynamically. You need sandboxed execution, read-only database permissions, query validation, and result size limits to prevent the agent from doing something catastrophic.

The interview move: Mention a specific framework you’ve used LangGraph, CrewAI, or custom orchestration and explain why you chose it. “I chose LangGraph because the explicit state graph gave me better control over execution flow and made debugging significantly easier than implicit chain-based frameworks.” This signals real-world decision-making, not just theoretical knowledge.

Question 09 · System Design at Scale

“Have you built or been part of a large-scale document processing pipeline? How did you think about cost and reliability at that volume?”

This is the question that reveals whether you think in systems or in notebooks.

It’s not about the model. It’s about the infrastructure, the cost, and the reliability around the model. Anyone can write a Python script that calls an LLM and summarizes a document. Building a system that reliably processes 10,000 of them per day at reasonable cost, with quality guarantees, is a different order of magnitude of problem.

Why they ask this: System design questions test your ability to think at scale. It’s not about the model, it’s about everything around it.

Answer Framework for Q9 — System Design at Scale

Let me show you how to structure your answer with back-of-envelope mathbecause interviewers remember candidates who do the math.

The Math First

10,000 docs/day × 2,000 tokens each = 20M tokens/day

At batch API pricing: roughly $0.30–$1.00 per 1M tokens = $6–$20/day input cost, plus output.

That’s the budget constraint you’re designing around.

Architecture

Message Queue (SQS/Kafka) 
    → Worker Pool 
    → LLM API 
    → Result Store (S3/DB)

Documents enter the queue. Workers pull documents, call the LLM for summarization, store results. Decouple ingestion from processing so neither can block the other.

Chunking Strategy

Documents over the context window limit need a hierarchical approach: split the document into chunks, summaries each chunk, then summaries the summaries. For very long documents, this map-reduce pattern is the only viable approach.

Cost Engineering

Use batch APIs 50% cheaper than real-time endpoints
Route simple documents (short, structured) to smaller, cheaper models; route complex documents to the large model
Cache summaries for duplicate or near-duplicate documents (more common than you’d think in document processing pipelines)

Reliability

Dead-letter queues for failed jobs so capture failures without losing documents
Retry with exponential backoff
Idempotent processing re-runs don’t create duplicate summaries
Auto-scaling workers based on queue depth

Quality Assurance

Sample 1–2% of outputs daily for human review. Track summary length, extractive density, and factual consistency over time. Alert on drift from quality baselines.

The interview move: Start with the math. Every time. “10K docs × 2K tokens = 20M tokens/day = roughly $X/day. That’s the constraint I’m designing around.” It immediately signals engineering maturity.

Question 10 · Guardrail

“What safety or guardrail measures have you actually shipped? Were there any cases where your system behaved in a way you didn’t expect?”

This is the last question in the interview. It’s also the one that most predicts whether you’ll be trusted to ship consequential systems.

Why they ask this: Safety isn’t optional anymore. Regulators, legal teams, and users all care. Companies need engineers who build responsibly — not just effectively.

Same as usual below steps:

Safety is a defense-in-depth problem, exactly like security. No single mechanism is sufficient. You need overlapping layers.

Input Guardrails

Filter or flag harmful and adversarial inputs before they reach the model. Use classifiers for:

Prompt injection detection (malicious instructions hidden in user input or retrieved documents)
PII detection (filter or pseudonymize sensitive data before it enters the context)
Topic filtering (block off-topic or policy-violating requests at the API gateway layer)

Block or sanitize dangerous inputs at the edge, don’t let them reach the model at all.

Output Guardrails

Validate model outputs before they reach users. Check for:

PII leakage in responses
Harmful or policy-violating content
Off-topic responses
Factually inconsistent outputs (for RAG systems)

Use a secondary classifier or rules engine as a safety net.

Red Teaming

Systematically try to break your own system. Test for:

Jailbreaks (attempts to bypass safety instructions)
Indirect prompt injection (malicious content hidden in retrieved documents that hijacks agent behavior)
Data extraction attacks (attempts to get the model to reveal training data or system prompt contents)
Social engineering patterns

This is not a one-time exercise. Red team on a schedule before major releases, after model updates, after prompt changes.

Monitoring & Observability

Log all inputs and outputs (with PII redaction). Track:

Refusal rates — are they too high (over-blocking) or too low (under-blocking)?
Flagged content rates
User-reported problems

Set up alerts for anomalous patterns that might indicate active adversarial use or system drift.

Human-in-the-Loop

For high-stakes domains such as medical, legal, financial, compliance route uncertain or sensitive outputs to human reviewers before delivery. Design escalation paths so the system fails gracefully rather than dangerously. A system that says “I need a human to verify this” is better than one that confidently says the wrong thing.

The interview move: Reference real frameworks. “I follow OWASP’s LLM Top 10 for threat modelling and NIST’s AI RMF for risk management.” This signals that you think systematically about safety, not just reactively.

How to Actually Use This Guide

Here’s my suggested practice protocol:

Week 1: Conceptual Internalization

Read each section once. Then, without looking at the article, try to explain each concept to an imaginary colleague who is technically competent but not an AI specialist. The moment you can’t explain something clearly, that’s a gap and gaps are where interviewers live.

Week 2: Build a Mini-System

Pick any one of the ten topics and build a working prototype. You don’t need a production system. But actually implementing RAG retrieval, or building a simple agent loop, or setting up an LLM-as-Judge eval pipeline even at toy scale cements the knowledge in a way reading never does. Real systems reveal real questions.

Week 3: Adversarial Practice

Find a friend or colleague and have them play interviewer. Ask them to probe hard: “What happens if the re-ranker introduces latency?” “How do you handle a document that’s 100K tokens?” “What’s the cost of running this at 10x scale?” “What’s the failure mode if your queue backs up?” The answers that survive adversarial questioning are the answers you’ll actually give in interviews.

The Day Before

Review the “interview move” callouts. Not to recite them but to remind yourself of the differentiating signals. What does a production-experienced engineer sound like, versus someone who’s been studying?

The Questions They Didn’t List But Should Have

A quick note: the ten questions in this guide are the most common, but the conversation doesn’t end at ten. Here are five questions that increasingly appear in senior AI engineer interviews, along with one-line pointers to where to dig:

“How do you handle long-term memory in agents?” → Look into episodic memory architectures, compressed memory approaches, and the tradeoffs between in-context history, external memory stores, and retrieval-augmented memory.

I would like to recommend to go this article for your better critical solution approach →

“How would you prevent prompt injection in a production agent?” → Input sanitization, prompt partitioning, privilege separation (don’t trust user input with system-level instructions), and layered validation.

“Explain the architecture of a multi-agent system.” → Orchestrator-subagent patterns, message passing, shared state management, conflict resolution, and the tradeoffs of centralized versus decentralized coordination.

“How do you version-control prompts in production?” → Prompt registries, A/B testing infrastructure, rollback mechanisms, and integrating prompt changes into your CI/CD pipeline.

“What’s the difference between fine-tuning, RLHF, and DPO?” → Fine-tuning shapes format/style, RLHF (Reinforcement Learning from Human Feedback) shapes behavior alignment using reward models, DPO (Direct Preference Optimization) achieves similar alignment goals without explicit reward models which typically simpler to implement and increasingly preferred.

The Thread That Runs Through All of It

Read those ten questions again and notice the pattern.

Every single one of them is a systems thinking question dressed up as an AI question.

Hallucinations → architectural defense-in-depth
Cost → routing, caching, batching, model selection
Production failures → monitoring, drift detection, eval mismatch
RAG design → document processing + retrieval + re-ranking + evaluation
Fine-tuning → iteration speed, cost, task specificity
Evaluation → metrics, regression testing, human feedback loops
Embeddings → vector indexing, ANN algorithms, scaling
Agents → observe/think/act loops, error handling, safety
Document summarization → queues, workers, cost math, quality sampling
Safety → input/output guardrails, red teaming, monitoring, human oversight

The AI engineer who gets hired is the one who can hold all of these concerns simultaneously. Who can talk about a language model the same way a backend engineer talks about a database — as a component in a larger system, with known failure modes, performance characteristics, cost profiles, and reliability requirements.

One More Thing

The night I got that rejection email, I was frustrated. But looking back, I’m grateful for it.

That interview taught me what I actually didn’t know. Not the surface stuff I knew the vocabulary. But the depth. The production experience. The systems thinking.

The engineers who get hired for senior AI roles aren’t the ones who’ve read the most papers or built the most demos. They’re the ones who’ve shipped things, debugged things at 3 AM, thought carefully about cost and safety and reliability, and built mental models deep enough to reason on their feet under pressure.

These ten questions are a map to that depth.

Study them not as answers to memories but as problems to internalize. Build small systems to test your understanding. Find the gaps between what you think you know and what you can actually explain clearly to another engineer.

That gap — closing it — is the work.

If you’ve stayed with me to the end, thank you! your curiosity and attention mean more than you know. I share thoughtful updates on the fast‑moving world of AI, so if you’d like to keep exploring with me, feel free to 🔔 follow clap👏👏👏 subscribe 📩 and I’ll keep the good stuff coming 🙌❤️.

I’d love to hear which question you found most challenging drop a comment below and I’d try to give my best answer for sure ❤️.

The 10 Questions That Decide Whether You’re an AI Engineer or Just an AI User was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.