Things Not to Miss: A High-Signal Refresher for ML Engineering Interviews in 2026

Most ML interview prep advice is outdated.

It over-indexes on theory, ignores implementation, and completely misses how top companies evaluate candidates today. Here is the reality in 2026:

You are not being tested on what you know. You are being tested on what you can build, explain, and optimize under constraints.

This guide is a high-signal refresher. No fluff. Only what moves the needle.

1. Self-Attention: The One Concept You Cannot Fake

If you only master one thing, make it self-attention.

formula

Attention(Q, K, V) = softmax( QKᵀ / √d_k ) · V

What strong candidates understand (not memorize)

Why we divide by √d
→ Prevents large dot products that collapse softmax gradients:
dot products between random unit vectors grow in magnitude with dimension. Dividing by √d_k keeps the variance of the dot product at ~1, preventing softmax from saturating and collapsing gradients.
Why softmax is essential
→ Converts similarity into a probability distribution:
it converts raw dot-product similarities into a probability distribution that sums to 1, so the output is a convex combination of values.
What attention is doing conceptually
→ Dynamic, input-dependent routing of information:
Each query decides which keys to attend to, and the output is a weighted blend of the corresponding values.

What you should be able to implement (from scratch)

def attention(Q, K, V):
    scores = Q @ K.T / (d ** 0.5)
    weights = softmax(scores)
    return weights @ V

High-probability variations

Masked (causal) attention
Multi-head attention (split → project → concat)
Grouped Query Attention (shared K/V for efficiency)

If you can implement and explain this cleanly, you are already ahead of most candidates.

Single-head attention — from scratch

import numpy as np

def softmax(x, axis=-1):
    # Numerically stable: subtract row-max before exp
    e = np.exp(x - x.max(axis=axis, keepdims=True))
    return e / e.sum(axis=axis, keepdims=True)

def attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]                        # head dimension
    scores = Q @ K.T / np.sqrt(d_k)          # (seq, seq)
    if mask is not None:
        scores = scores + mask                # add -inf where masked
    weights = softmax(scores)                 # (seq, seq)
    return weights @ V, weights               # output + attention weights

# Causal mask — tokens cannot attend to future positions
def causal_mask(seq_len):
    return np.triu(np.full((seq_len, seq_len), -np.inf), k=1)

Multi-head attention

def multi_head_attention(X, W_q, W_k, W_v, W_o, n_heads):
    d_model = X.shape[-1]
    d_head  = d_model // n_heads

    Q = X @ W_q   # (seq, d_model)
    K = X @ W_k
    V = X @ W_v

    # Split into heads: (n_heads, seq, d_head)
    def split(t):
        seq = t.shape[0]
        return t.reshape(seq, n_heads, d_head).transpose(1, 0, 2)

    Qs, Ks, Vs = split(Q), split(K), split(V)

    heads = []
    for i in range(n_heads):
        out, _ = attention(Qs[i], Ks[i], Vs[i])
        heads.append(out)

    concat = np.concatenate(heads, axis=-1)   # (seq, d_model)
    return concat @ W_o

What interviewers actually ask

• “What breaks if we remove softmax?”
— Weights no longer sum to 1. The output becomes an unnormalized weighted sum; attention cannot focus anywhere meaningful.

• “Why does attention scale poorly with sequence length?”
— The QKᵀ product is O(n²) in time and memory. For 10k tokens, that’s 100M scores per head per layer.

• “Self-attention vs cross-attention?”
— In self-attention, Q, K, V all come from the same sequence. In cross-attention (encoder-decoder), Q comes from the decoder while K and V come from the encoder output.

• “How would you optimize attention for long inputs?”
— Flash Attention (reorder computation to reduce HBM reads), sliding window attention (restrict each token to a local context window), or linear attention approximations.

2. LLM Sampling: Where Theory Meets Product Reality

This is where ML meets user experience. You are controlling two dials:

Determinism → correctness
Randomness → creativity

You must understand:

Temperature scaling
Top-k sampling
Top-p (nucleus) sampling
Greedy vs stochastic decoding

Full implementation: temperature + top-k + top-p

import numpy as np

def sample_token(logits, temperature=1.0, top_k=0, top_p=1.0):
    """
    logits:      raw model output (vocab_size,)
    temperature: >1 = more random, <1 = more peaked, 0 = greedy
    top_k:       keep only k highest-probability tokens (0 = disabled)
    top_p:       nucleus sampling — keep smallest set summing to p
    Returns:     sampled token index
    """
    # 1. Greedy shortcut
    if temperature == 0:
        return int(np.argmax(logits))

    # 2. Temperature scaling
    logits = logits / temperature

    # 3. Top-k filtering
    if top_k > 0:
        kth_val = np.partition(logits, -top_k)[-top_k]
        logits[logits < kth_val] = -np.inf

    # 4. Softmax → probabilities
    probs = np.exp(logits - logits.max())
    probs /= probs.sum()

    # 5. Top-p (nucleus) filtering
    if top_p < 1.0:
        sorted_idx = np.argsort(probs)[::-1]
        cum_probs  = np.cumsum(probs[sorted_idx])
        cutoff     = np.searchsorted(cum_probs, top_p) + 1
        probs[sorted_idx[cutoff:]] = 0
        probs /= probs.sum()   # renormalize

    return int(np.random.choice(len(probs), p=probs))

Implementation expectation

Sort logits
Apply cutoff (k or cumulative p)
Normalize
Sample

What interviewers ask:

“How would you reduce hallucinations?”
“When would you prefer top-p over top-k?”
“What happens when temperature → 0?”

When to use which method in production

• Temperature low (0.2–0.5): factual Q&A, code generation — you want near-deterministic outputs.

• Temperature high (0.8–1.2): creative writing, brainstorming — you want diversity.

• Top-p preferred over top-k when: the distribution varies widely across inputs (adaptive cutoff handles this better than a fixed k).

• Top-k + top-p together: a common production default — top-k=50, top-p=0.9, temperature=0.8.

High-signal insight

Most candidates explain sampling.
Strong candidates explain when to use which method in production.

3. KV Cache: The Systems Design Signal

Without a KV cache, generating each new token recomputes K and V for all previous tokens — O(n²) cost as sequence grows. The KV cache stores those K and V tensors from past steps and reuses them, reducing per-step attention to O(n).

Interview question

“How would you reduce latency for long sequences?”

Answer:

Use KV cache
Batch requests
Possibly truncate context
Consider approximate attention

This is where you show systems thinking

For a model with L transformer layers, H attention heads, head dimension d, and sequence length T:

KV cache size = 2 · L · H · d · T · sizeof(dtype)

For a 70B-parameter model at fp16 with a 32k-token context, this can exceed 100GB — which is why techniques like Grouped Query Attention (GQA) and sliding window attention exist. GQA shares K and V across multiple query heads, reducing the cache by a factor of (n_heads / n_kv_heads).

Strong answer when asked about latency: cache K/V, batch requests, use approximate attention for long sequences, consider context truncation as a last resort.

Incremental decode with a KV cache

import numpy as np

class KVCache:
    def __init__(self):
        self.keys   = []   # list of (d_k,) arrays, one per past token
        self.values = []

    def update(self, new_k, new_v):
        self.keys.append(new_k)
        self.values.append(new_v)

    def get(self):
        return (np.stack(self.keys),     # (seq_so_far, d_k)
                np.stack(self.values))   # (seq_so_far, d_v)


def decode_step(new_token_q, cache: KVCache):
    """
    Attend over ALL cached keys/values using only the ONE new query.
    This avoids recomputing K and V for past tokens.
    """
    K, V = cache.get()
    d_k  = new_token_q.shape[-1]
    score  = (new_token_q @ K.T) / np.sqrt(d_k)  # (1, seq_so_far)
    weight = softmax(score)
    return weight @ V                              # (1, d_v)

4. Tokenization (BPE): The Underrated Filter

Tokenization questions are deceptively simple. They test whether you understand how models see text.

Q. Character-level tokenization vs BPE (Byte Pair Encoding)

Character-level tokenization

Splits text into individual characters
"startup" → s t a r t u p
Vocabulary size: very small (just characters)
Sequence length: very long
Pros: Handles any text (no unknown tokens), Simple
Cons: Inefficient for transformers (attention cost grows fast), Harder for the model to learn meaning from tiny units

BPE (Byte Pair Encoding)

Merges frequent character sequences into subwords
"startup" → start + up or even startup
Vocabulary size: medium (subwords)
Sequence length: much shorter
Pros: More efficient (fewer tokens); Captures meaningful chunks (prefixes, suffixes, common words) and still handles rare words by breaking them into subwords
Cons: Slightly more complex; vocabulary must be learned.

Key difference (the essence)

Character-level → small vocab, long sequences
BPE → larger vocab, shorter sequences

Q. Why BPE wins in practice

Transformers care a lot about sequence length.
Since attention scales roughly with length²:

Character-level → expensive
BPE → much cheaper + better semantic units

One-line intuition

Character-level reads letter by letter
BPE reads chunks of meaning

Q. Byte Pair Encoding (BPE) Steps

The process starts with characters and iteratively merges the most frequent adjacent pair until the vocabulary reaches the target size.

Start with characters
Count most frequent pairs
Merge them
Repeat

Q. Why BPE works

Handles unseen words: Word-level tokenization fails on rare words and produces huge vocabularies.
Compresses representation: Character-level produces very long sequences, which is expensive for attention.
Balances vocabulary size and flexibility: BPE finds a middle ground: common words become single tokens, rare words decompose into subword pieces.

Implementation expectation

Pair frequency counting
Merge rules
Iterative updates

from collections import Counter

def get_pair_freqs(vocab):
    """Count adjacent symbol pairs across all words."""
    pairs = Counter()
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i + 1])] += freq
    return pairs

def merge_pair(pair, vocab):
    """Replace every occurrence of a pair with its merged token."""
    merged      = ' '.join(pair)
    replacement = ''.join(pair)
    return {w.replace(merged, replacement): f
            for w, f in vocab.items()}

def train_bpe(corpus, num_merges=10):
    # Initialise: each word as a space-separated char sequence + </w>
    vocab = Counter()
    for word in corpus.lower().split():
        vocab[' '.join(word) + ' </w>'] += 1

    merges = []
    for step in range(num_merges):
        pairs = get_pair_freqs(vocab)
        if not pairs:
            break
        best  = max(pairs, key=pairs.get)
        vocab = merge_pair(best, vocab)
        merges.append(best)
        print(f'Step {step + 1}: merged {best}')

    return vocab, merges

# Example
corpus = 'low lower lowest lower low new newer'
vocab, merges = train_bpe(corpus, num_merges=5)

5. Classical ML: K-Means & KNN

It is still important to refresh core ML fundamentals.

These concepts remain a key part of interviews, especially at more generalist companies or for ML roles where a significant portion of the work involves maintaining or improving legacy systems.

Even in modern AI stacks, classical algorithms are still used. Strong fundamentals here signal that you understand ML beyond just frameworks and LLMs.

Beyond K-means and KNN, it is also worth reviewing algorithms like decision trees, random forests, and SVMs, as they frequently come up in interviews and practical scenarios.

K-Means (You must know):

Objective: minimize intra-cluster variance
Iterative process:
Assign points
Update centroids

Edge case (often tested)

Empty clusters

K-Means interview questions

• “Why does K-Means fail on non-spherical data?” — WCSS assumes spherical, equally-sized clusters because it uses Euclidean distance to centroids. Use DBSCAN or Gaussian Mixture Models for arbitrary shapes.

“How does initialization affect convergence?” — Random init can land in poor local minima. K-Means++ selects initial centroids spread out across the data, consistently reaching better solutions.

# K-Means with K-Means++ init and edge case handling
import numpy as np

def kmeans(X, k, max_iter=100, tol=1e-4, seed=42):
    """
    X:        (n_samples, n_features)
    k:        number of clusters
    Returns:  centroids (k, n_features), labels (n_samples,)
    Objective: minimize WCSS = sum_j sum_{x in C_j} ||x - mu_j||^2
    """
    rng = np.random.default_rng(seed)
    n   = X.shape[0]

    # --- K-Means++ initialisation (better than random) ---
    idx = [rng.integers(n)]
    for _ in range(k - 1):
        # Distance from each point to nearest chosen centroid
        dists = np.min(
            [np.sum((X - X[i]) ** 2, axis=1) for i in idx], axis=0
        )
        # Sample proportional to distance squared
        probs = dists / dists.sum()
        idx.append(rng.choice(n, p=probs))
    centroids = X[idx].copy()

    for iteration in range(max_iter):
        # Assign each point to nearest centroid
        dists  = np.linalg.norm(X[:, None] - centroids, axis=2)  # (n, k)
        labels = dists.argmin(axis=1)                             # (n,)

        new_centroids = np.zeros_like(centroids)
        for j in range(k):
            members = X[labels == j]
            if len(members) == 0:
                # Edge case: empty cluster → reinit to a random data point
                new_centroids[j] = X[rng.integers(n)]
            else:
                new_centroids[j] = members.mean(axis=0)

        # Convergence: centroid shift below tolerance
        shift = np.linalg.norm(new_centroids - centroids)

K-Nearest Neighbors (KNN) core ideas:

Distance metric
No training phase
Expensive inference

#  KNN classifier

import numpy as np
from collections import Counter

class KNN:
    def __init__(self, k=5):
        self.k = k

    def fit(self, X, y):
        # No training phase — KNN is a lazy learner
        self.X_train = X
        self.y_train = y

    def predict(self, X):
        preds = []
        for x in X:
            # Brute-force O(n*d) distance computation
            dists  = np.linalg.norm(self.X_train - x, axis=1)
            nn_idx = np.argsort(dists)[:self.k]
            votes  = Counter(self.y_train[nn_idx])
            preds.append(votes.most_common(1)[0][0])
        return np.array(preds)

# Production note: for large n, replace brute-force with FAISS/HNSW
# which gives approximate nearest neighbors in O(log n)

KNN interview questions

• “Why does KNN not scale?” — O(n·d) per query at inference time, and it stores the entire training set in memory. Contrast with parametric models that compress knowledge into weights.

“How would you optimize nearest neighbor search?” — Approximate methods: FAISS (Facebook AI Similarity Search), HNSW (Hierarchical Navigable Small World graphs), or Annoy. Trade exact accuracy for orders-of-magnitude speedup.

6. Backpropagation: Do You Actually Understand Neural Networks?

This is the “no shortcuts” test. You should be able to:

Implement a simple neural network
Compute gradients manually
Use chain rule correctly

Expect:

ReLU / sigmoid derivatives
Loss gradients
Weight updates

Two-layer MLP with manual backprop

import numpy as np

# --- Activations and their gradients ---
def relu(x):        return np.maximum(0, x)
def relu_grad(x):   return (x > 0).astype(float)
def sigmoid(x):     return 1 / (1 + np.exp(-x))

# --- Binary cross-entropy loss ---
def bce_loss(y_pred, y_true):
    eps = 1e-8
    return -np.mean(
        y_true * np.log(y_pred + eps) +
        (1 - y_true) * np.log(1 - y_pred + eps)
    )

# --- Forward pass ---
def forward(X, W1, b1, W2, b2):
    z1 = X  @ W1 + b1     # (batch, hidden)
    a1 = relu(z1)          # non-linearity
    z2 = a1 @ W2 + b2     # (batch, out)
    a2 = sigmoid(z2)       # output probabilities
    return z1, a1, z2, a2

# --- Backward pass (chain rule) ---
def backward(X, y, z1, a1, a2, W2):
    n = X.shape[0]

    # Output layer: dL/dz2 (BCE loss + sigmoid simplifies to this)
    dz2 = (a2 - y) / n        # (batch, out)
    dW2 = a1.T @ dz2          # (hidden, out)
    db2 = dz2.sum(axis=0)     # (out,)

    # Hidden layer: chain rule through ReLU
    da1 = dz2 @ W2.T          # (batch, hidden)
    dz1 = da1 * relu_grad(z1) # zero where z1 <= 0
    dW1 = X.T  @ dz1          # (features, hidden)
    db1 = dz1.sum(axis=0)     # (hidden,)

    return dW1, db1, dW2, db2

# --- Single training step ---
def train_step(X, y, W1, b1, W2, b2, lr=0.01):
    z1, a1, z2, a2       = forward(X, W1, b1, W2, b2)
    loss                  = bce_loss(a2, y)
    dW1, db1, dW2, db2   = backward(X, y, z1, a1, a2, W2)

    W1 -= lr * dW1;  b1 -= lr * db1
    W2 -= lr * dW2;  b2 -= lr * db2
    return loss, W1, b1, W2, b2

What they are really testing

• Do you know that dL/dz2 = (a2 — y)/n for BCE + sigmoid — or do you have to rederive it?

• Can you explain why ReLU gradients are 0 for negative inputs (the ‘dying ReLU’ problem)?

• Can you generalize to arbitrary depth using the chain rule?

7. If You Only Have Limited Time

Prioritize in this order:

1. Self-attention and multi-head attention — implement it, explain the math, discuss complexity.

2. LLM sampling (temperature, top-k, top-p) — implement all three in one function.

3. K-Means with edge cases — empty clusters, K-Means++ init, convergence criterion.

4. Backpropagation — two-layer MLP, manual gradients, chain rule.

5. BPE tokenizer — pair counting, merge rules, iterative updates.

6. KV cache and systems reasoning — memory formula, GQA, batching strategies.

8. How You Are Actually Evaluated

Most candidates fail for predictable reasons: they jump into code without structure, skip tradeoff discussion, and treat problems as academic exercises rather than real engineering decisions.

What strong candidates do differently

• They structure their thinking first. “I’ll start with a simple correct version, then optimize for the constraints you mentioned.”

• They talk about tradeoffs constantly. “This approach improves latency but increases memory — is that acceptable in your setting?”

• They think in systems. Scale, latency, reliability, cost — not just accuracy.

• They know their complexity. Every algorithm should come with a time and space complexity analysis without prompting.

The ML interview in 2026 is not about memorizing models. It is about building primitives from scratch, explaining decisions clearly, and thinking like a production engineer.

If you can implement attention, explain the tradeoffs of KV caching, and walk through backpropagation without notes — you are already ahead of most candidates.

This is not about knowing everything. It is about mastering the few things that actually matter.

Resources for practice:

Things Not to Miss: A High-Signal Refresher for ML Engineering Interviews in 2026 was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Things Not to Miss: A High-Signal Refresher for ML Engineering Interviews in 2026

1. Self-Attention: The One Concept You Cannot Fake

What strong candidates understand (not memorize)

High-probability variations

Single-head attention — from scratch

Multi-head attention

What interviewers actually ask

2. LLM Sampling: Where Theory Meets Product Reality

You must understand:

Implementation expectation

What interviewers ask:

When to use which method in production

High-signal insight

3. KV Cache: The Systems Design Signal

Interview question

Answer:

Incremental decode with a KV cache

4. Tokenization (BPE): The Underrated Filter

Q. Character-level tokenization vs BPE (Byte Pair Encoding)

Key difference (the essence)

Q. Why BPE wins in practice

One-line intuition

Q. Byte Pair Encoding (BPE) Steps

Q. Why BPE works

Other Interview questions on tokenization

Implementation expectation

5. Classical ML: K-Means & KNN

K-Means (You must know):

Edge case (often tested)

K-Means interview questions

K-Nearest Neighbors (KNN) core ideas:

KNN interview questions

6. Backpropagation: Do You Actually Understand Neural Networks?

Expect:

Two-layer MLP with manual backprop

What they are really testing

7. If You Only Have Limited Time

8. How You Are Actually Evaluated

What strong candidates do differently

Leave a Comment