Things Not to Miss: A High-Signal Refresher for ML Engineering Interviews in 2026

Most ML interview prep advice is outdated.

It over-indexes on theory, ignores implementation, and completely misses how top companies evaluate candidates today. Here is the reality in 2026:

You are not being tested on what you know. You are being tested on what you can build, explain, and optimize under constraints.

This guide is a high-signal refresher. No fluff. Only what moves the needle.

1. Self-Attention: The One Concept You Cannot Fake

If you only master one thing, make it self-attention.

formula

Attention(Q, K, V) = softmax( QKᵀ / √d_k ) · V

What strong candidates understand (not memorize)

  • Why we divide by √d
    → Prevents large dot products that collapse softmax gradients:
    dot products between random unit vectors grow in magnitude with dimension. Dividing by √d_k keeps the variance of the dot product at ~1, preventing softmax from saturating and collapsing gradients.
  • Why softmax is essential
    → Converts similarity into a probability distribution:
    it converts raw dot-product similarities into a probability distribution that sums to 1, so the output is a convex combination of values.
  • What attention is doing conceptually
    Dynamic, input-dependent routing of information:
    Each query decides which keys to attend to, and the output is a weighted blend of the corresponding values.

What you should be able to implement (from scratch)

def attention(Q, K, V):
scores = Q @ K.T / (d ** 0.5)
weights = softmax(scores)
return weights @ V

High-probability variations

  • Masked (causal) attention
  • Multi-head attention (split → project → concat)
  • Grouped Query Attention (shared K/V for efficiency)

If you can implement and explain this cleanly, you are already ahead of most candidates.

Single-head attention — from scratch

import numpy as np

def softmax(x, axis=-1):
# Numerically stable: subtract row-max before exp
e = np.exp(x - x.max(axis=axis, keepdims=True))
return e / e.sum(axis=axis, keepdims=True)

def attention(Q, K, V, mask=None):
d_k = Q.shape[-1] # head dimension
scores = Q @ K.T / np.sqrt(d_k) # (seq, seq)
if mask is not None:
scores = scores + mask # add -inf where masked
weights = softmax(scores) # (seq, seq)
return weights @ V, weights # output + attention weights

# Causal mask — tokens cannot attend to future positions
def causal_mask(seq_len):
return np.triu(np.full((seq_len, seq_len), -np.inf), k=1)

Multi-head attention

def multi_head_attention(X, W_q, W_k, W_v, W_o, n_heads):
d_model = X.shape[-1]
d_head = d_model // n_heads

Q = X @ W_q # (seq, d_model)
K = X @ W_k
V = X @ W_v

# Split into heads: (n_heads, seq, d_head)
def split(t):
seq = t.shape[0]
return t.reshape(seq, n_heads, d_head).transpose(1, 0, 2)

Qs, Ks, Vs = split(Q), split(K), split(V)

heads = []
for i in range(n_heads):
out, _ = attention(Qs[i], Ks[i], Vs[i])
heads.append(out)

concat = np.concatenate(heads, axis=-1) # (seq, d_model)
return concat @ W_o

What interviewers actually ask

• “What breaks if we remove softmax?”
— Weights no longer sum to 1. The output becomes an unnormalized weighted sum; attention cannot focus anywhere meaningful.

• “Why does attention scale poorly with sequence length?”
— The QKᵀ product is O(n²) in time and memory. For 10k tokens, that’s 100M scores per head per layer.

• “Self-attention vs cross-attention?”
 — In self-attention, Q, K, V all come from the same sequence. In cross-attention (encoder-decoder), Q comes from the decoder while K and V come from the encoder output.

• “How would you optimize attention for long inputs?”
— Flash Attention (reorder computation to reduce HBM reads), sliding window attention (restrict each token to a local context window), or linear attention approximations.

2. LLM Sampling: Where Theory Meets Product Reality

This is where ML meets user experience. You are controlling two dials:

  • Determinism → correctness
  • Randomness → creativity

You must understand:

  • Temperature scaling
  • Top-k sampling
  • Top-p (nucleus) sampling
  • Greedy vs stochastic decoding

Full implementation: temperature + top-k + top-p

import numpy as np

def sample_token(logits, temperature=1.0, top_k=0, top_p=1.0):
"""
logits: raw model output (vocab_size,)
temperature: >1 = more random, <1 = more peaked, 0 = greedy
top_k: keep only k highest-probability tokens (0 = disabled)
top_p: nucleus sampling — keep smallest set summing to p
Returns: sampled token index
"""
# 1. Greedy shortcut
if temperature == 0:
return int(np.argmax(logits))

# 2. Temperature scaling
logits = logits / temperature

# 3. Top-k filtering
if top_k > 0:
kth_val = np.partition(logits, -top_k)[-top_k]
logits[logits < kth_val] = -np.inf

# 4. Softmax → probabilities
probs = np.exp(logits - logits.max())
probs /= probs.sum()

# 5. Top-p (nucleus) filtering
if top_p < 1.0:
sorted_idx = np.argsort(probs)[::-1]
cum_probs = np.cumsum(probs[sorted_idx])
cutoff = np.searchsorted(cum_probs, top_p) + 1
probs[sorted_idx[cutoff:]] = 0
probs /= probs.sum() # renormalize

return int(np.random.choice(len(probs), p=probs))

Implementation expectation

  • Sort logits
  • Apply cutoff (k or cumulative p)
  • Normalize
  • Sample

What interviewers ask:

  • “How would you reduce hallucinations?”
  • “When would you prefer top-p over top-k?”
  • “What happens when temperature → 0?”

When to use which method in production

• Temperature low (0.2–0.5): factual Q&A, code generation — you want near-deterministic outputs.

• Temperature high (0.8–1.2): creative writing, brainstorming — you want diversity.

• Top-p preferred over top-k when: the distribution varies widely across inputs (adaptive cutoff handles this better than a fixed k).

• Top-k + top-p together: a common production default — top-k=50, top-p=0.9, temperature=0.8.

High-signal insight

Most candidates explain sampling.
Strong candidates explain when to use which method in production.

3. KV Cache: The Systems Design Signal

Without a KV cache, generating each new token recomputes K and V for all previous tokens — O(n²) cost as sequence grows. The KV cache stores those K and V tensors from past steps and reuses them, reducing per-step attention to O(n).

Interview question

  • “How would you reduce latency for long sequences?”

Answer:

  • Use KV cache
  • Batch requests
  • Possibly truncate context
  • Consider approximate attention

This is where you show systems thinking

For a model with L transformer layers, H attention heads, head dimension d, and sequence length T:

KV cache size = 2 · L · H · d · T · sizeof(dtype)

For a 70B-parameter model at fp16 with a 32k-token context, this can exceed 100GB — which is why techniques like Grouped Query Attention (GQA) and sliding window attention exist. GQA shares K and V across multiple query heads, reducing the cache by a factor of (n_heads / n_kv_heads).

Strong answer when asked about latency: cache K/V, batch requests, use approximate attention for long sequences, consider context truncation as a last resort.

Incremental decode with a KV cache

import numpy as np

class KVCache:
def __init__(self):
self.keys = [] # list of (d_k,) arrays, one per past token
self.values = []

def update(self, new_k, new_v):
self.keys.append(new_k)
self.values.append(new_v)

def get(self):
return (np.stack(self.keys), # (seq_so_far, d_k)
np.stack(self.values)) # (seq_so_far, d_v)


def decode_step(new_token_q, cache: KVCache):
"""
Attend over ALL cached keys/values using only the ONE new query.
This avoids recomputing K and V for past tokens.
"""
K, V = cache.get()
d_k = new_token_q.shape[-1]
score = (new_token_q @ K.T) / np.sqrt(d_k) # (1, seq_so_far)
weight = softmax(score)
return weight @ V # (1, d_v)

4. Tokenization (BPE): The Underrated Filter

Tokenization questions are deceptively simple. They test whether you understand how models see text.

Q. Character-level tokenization vs BPE (Byte Pair Encoding)

Character-level tokenization

  • Splits text into individual characters
  • "startup" → s t a r t u p
  • Vocabulary size: very small (just characters)
  • Sequence length: very long
  • Pros: Handles any text (no unknown tokens), Simple
  • Cons: Inefficient for transformers (attention cost grows fast), Harder for the model to learn meaning from tiny units

BPE (Byte Pair Encoding)

  • Merges frequent character sequences into subwords
  • "startup" → start + up or even startup
  • Vocabulary size: medium (subwords)
  • Sequence length: much shorter
  • Pros: More efficient (fewer tokens); Captures meaningful chunks (prefixes, suffixes, common words) and still handles rare words by breaking them into subwords
  • Cons: Slightly more complex; vocabulary must be learned.

Key difference (the essence)

  • Character-level → small vocab, long sequences
  • BPE → larger vocab, shorter sequences

Q. Why BPE wins in practice

Transformers care a lot about sequence length.
Since attention scales roughly with length²:

  • Character-level → expensive
  • BPE → much cheaper + better semantic units

One-line intuition

Character-level reads letter by letter
BPE reads chunks of meaning

Q. Byte Pair Encoding (BPE) Steps

The process starts with characters and iteratively merges the most frequent adjacent pair until the vocabulary reaches the target size.

  1. Start with characters
  2. Count most frequent pairs
  3. Merge them
  4. Repeat

Q. Why BPE works

  • Handles unseen words: Word-level tokenization fails on rare words and produces huge vocabularies.
  • Compresses representation: Character-level produces very long sequences, which is expensive for attention.
  • Balances vocabulary size and flexibility: BPE finds a middle ground: common words become single tokens, rare words decompose into subword pieces.

Other Interview questions on tokenization

• “Why not word-level?” — Unbounded vocabulary, no handling of unseen words, poor multilingual support.

• “What happens with rare words?” — They decompose into subword pieces, potentially all the way to individual characters in extreme cases.

• “How would you scale BPE to billions of tokens?” — Batch pair counting, parallel processing, approximate frequency estimates using sampling.

Implementation expectation

  • Pair frequency counting
  • Merge rules
  • Iterative updates
from collections import Counter

def get_pair_freqs(vocab):
"""Count adjacent symbol pairs across all words."""
pairs = Counter()
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols) - 1):
pairs[(symbols[i], symbols[i + 1])] += freq
return pairs

def merge_pair(pair, vocab):
"""Replace every occurrence of a pair with its merged token."""
merged = ' '.join(pair)
replacement = ''.join(pair)
return {w.replace(merged, replacement): f
for w, f in vocab.items()}

def train_bpe(corpus, num_merges=10):
# Initialise: each word as a space-separated char sequence + </w>
vocab = Counter()
for word in corpus.lower().split():
vocab[' '.join(word) + ' </w>'] += 1

merges = []
for step in range(num_merges):
pairs = get_pair_freqs(vocab)
if not pairs:
break
best = max(pairs, key=pairs.get)
vocab = merge_pair(best, vocab)
merges.append(best)
print(f'Step {step + 1}: merged {best}')

return vocab, merges

# Example
corpus = 'low lower lowest lower low new newer'
vocab, merges = train_bpe(corpus, num_merges=5)

5. Classical ML: K-Means & KNN

It is still important to refresh core ML fundamentals.

These concepts remain a key part of interviews, especially at more generalist companies or for ML roles where a significant portion of the work involves maintaining or improving legacy systems.

Even in modern AI stacks, classical algorithms are still used. Strong fundamentals here signal that you understand ML beyond just frameworks and LLMs.

Beyond K-means and KNN, it is also worth reviewing algorithms like decision trees, random forests, and SVMs, as they frequently come up in interviews and practical scenarios.

K-Means (You must know):

  • Objective: minimize intra-cluster variance
  • Iterative process:
  • Assign points
  • Update centroids

Edge case (often tested)

  • Empty clusters

K-Means interview questions

• “Why does K-Means fail on non-spherical data?” — WCSS assumes spherical, equally-sized clusters because it uses Euclidean distance to centroids. Use DBSCAN or Gaussian Mixture Models for arbitrary shapes.

  • “How does initialization affect convergence?” — Random init can land in poor local minima. K-Means++ selects initial centroids spread out across the data, consistently reaching better solutions.
# K-Means with K-Means++ init and edge case handling
import numpy as np

def kmeans(X, k, max_iter=100, tol=1e-4, seed=42):
"""
X: (n_samples, n_features)
k: number of clusters
Returns: centroids (k, n_features), labels (n_samples,)
Objective: minimize WCSS = sum_j sum_{x in C_j} ||x - mu_j||^2
"""
rng = np.random.default_rng(seed)
n = X.shape[0]

# --- K-Means++ initialisation (better than random) ---
idx = [rng.integers(n)]
for _ in range(k - 1):
# Distance from each point to nearest chosen centroid
dists = np.min(
[np.sum((X - X[i]) ** 2, axis=1) for i in idx], axis=0
)
# Sample proportional to distance squared
probs = dists / dists.sum()
idx.append(rng.choice(n, p=probs))
centroids = X[idx].copy()

for iteration in range(max_iter):
# Assign each point to nearest centroid
dists = np.linalg.norm(X[:, None] - centroids, axis=2) # (n, k)
labels = dists.argmin(axis=1) # (n,)

new_centroids = np.zeros_like(centroids)
for j in range(k):
members = X[labels == j]
if len(members) == 0:
# Edge case: empty cluster → reinit to a random data point
new_centroids[j] = X[rng.integers(n)]
else:
new_centroids[j] = members.mean(axis=0)

# Convergence: centroid shift below tolerance
shift = np.linalg.norm(new_centroids - centroids)

K-Nearest Neighbors (KNN) core ideas:

  • Distance metric
  • No training phase
  • Expensive inference
#  KNN classifier

import numpy as np
from collections import Counter

class KNN:
def __init__(self, k=5):
self.k = k

def fit(self, X, y):
# No training phase — KNN is a lazy learner
self.X_train = X
self.y_train = y

def predict(self, X):
preds = []
for x in X:
# Brute-force O(n*d) distance computation
dists = np.linalg.norm(self.X_train - x, axis=1)
nn_idx = np.argsort(dists)[:self.k]
votes = Counter(self.y_train[nn_idx])
preds.append(votes.most_common(1)[0][0])
return np.array(preds)

# Production note: for large n, replace brute-force with FAISS/HNSW
# which gives approximate nearest neighbors in O(log n)

KNN interview questions

• “Why does KNN not scale?” — O(n·d) per query at inference time, and it stores the entire training set in memory. Contrast with parametric models that compress knowledge into weights.

  • “How would you optimize nearest neighbor search?” — Approximate methods: FAISS (Facebook AI Similarity Search), HNSW (Hierarchical Navigable Small World graphs), or Annoy. Trade exact accuracy for orders-of-magnitude speedup.

6. Backpropagation: Do You Actually Understand Neural Networks?

This is the “no shortcuts” test. You should be able to:

  • Implement a simple neural network
  • Compute gradients manually
  • Use chain rule correctly

Expect:

  • ReLU / sigmoid derivatives
  • Loss gradients
  • Weight updates

Two-layer MLP with manual backprop

import numpy as np

# --- Activations and their gradients ---
def relu(x): return np.maximum(0, x)
def relu_grad(x): return (x > 0).astype(float)
def sigmoid(x): return 1 / (1 + np.exp(-x))

# --- Binary cross-entropy loss ---
def bce_loss(y_pred, y_true):
eps = 1e-8
return -np.mean(
y_true * np.log(y_pred + eps) +
(1 - y_true) * np.log(1 - y_pred + eps)
)

# --- Forward pass ---
def forward(X, W1, b1, W2, b2):
z1 = X @ W1 + b1 # (batch, hidden)
a1 = relu(z1) # non-linearity
z2 = a1 @ W2 + b2 # (batch, out)
a2 = sigmoid(z2) # output probabilities
return z1, a1, z2, a2

# --- Backward pass (chain rule) ---
def backward(X, y, z1, a1, a2, W2):
n = X.shape[0]

# Output layer: dL/dz2 (BCE loss + sigmoid simplifies to this)
dz2 = (a2 - y) / n # (batch, out)
dW2 = a1.T @ dz2 # (hidden, out)
db2 = dz2.sum(axis=0) # (out,)

# Hidden layer: chain rule through ReLU
da1 = dz2 @ W2.T # (batch, hidden)
dz1 = da1 * relu_grad(z1) # zero where z1 <= 0
dW1 = X.T @ dz1 # (features, hidden)
db1 = dz1.sum(axis=0) # (hidden,)

return dW1, db1, dW2, db2

# --- Single training step ---
def train_step(X, y, W1, b1, W2, b2, lr=0.01):
z1, a1, z2, a2 = forward(X, W1, b1, W2, b2)
loss = bce_loss(a2, y)
dW1, db1, dW2, db2 = backward(X, y, z1, a1, a2, W2)

W1 -= lr * dW1; b1 -= lr * db1
W2 -= lr * dW2; b2 -= lr * db2
return loss, W1, b1, W2, b2

What they are really testing

• Do you know that dL/dz2 = (a2 — y)/n for BCE + sigmoid — or do you have to rederive it?

• Can you explain why ReLU gradients are 0 for negative inputs (the ‘dying ReLU’ problem)?

• Can you generalize to arbitrary depth using the chain rule?

7. If You Only Have Limited Time

Prioritize in this order:

1. Self-attention and multi-head attention — implement it, explain the math, discuss complexity.

2. LLM sampling (temperature, top-k, top-p) — implement all three in one function.

3. K-Means with edge cases — empty clusters, K-Means++ init, convergence criterion.

4. Backpropagation — two-layer MLP, manual gradients, chain rule.

5. BPE tokenizer — pair counting, merge rules, iterative updates.

6. KV cache and systems reasoning — memory formula, GQA, batching strategies.

8. How You Are Actually Evaluated

Most candidates fail for predictable reasons: they jump into code without structure, skip tradeoff discussion, and treat problems as academic exercises rather than real engineering decisions.

What strong candidates do differently

• They structure their thinking first. “I’ll start with a simple correct version, then optimize for the constraints you mentioned.”

• They talk about tradeoffs constantly. “This approach improves latency but increases memory — is that acceptable in your setting?”

• They think in systems. Scale, latency, reliability, cost — not just accuracy.

• They know their complexity. Every algorithm should come with a time and space complexity analysis without prompting.

The ML interview in 2026 is not about memorizing models. It is about building primitives from scratch, explaining decisions clearly, and thinking like a production engineer.

If you can implement attention, explain the tradeoffs of KV caching, and walk through backpropagation without notes — you are already ahead of most candidates.

This is not about knowing everything. It is about mastering the few things that actually matter.

Resources for practice:


Things Not to Miss: A High-Signal Refresher for ML Engineering Interviews in 2026 was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top