K-Nearest Neighbors Is the Cockroach of Machine Learning. It Will Outlive Us All.
The algorithm behind the For You page, Spotify Discover Weekly, every reverse image search you’ve ever run, and the retrieval inside ChatGPT is older than color television.
K-Nearest Neighbors. Proposed by Evelyn Fix and Joseph Hodges in a US Air Force technical report in 1951. Never really replaced. Just renamed, sped up, and quietly stuffed into every vector database you’ve heard of. Pinecone. Weaviate. Qdrant. Chroma. Milvus. All of them. Same idea, faster index.
The strange part is that nobody teaches it like the foundation it is. KNN gets handed to first-year ML students as a “starter algorithm.” Then those same students graduate into companies running KNN, in production, at the bottom of a billion-dollar AI stack.

This is the KNN guide I wish someone had handed me before the textbook version.
Let’s go.
You wake up in Hanoi. You’ve never been there. You don’t speak Vietnamese. You’re standing in a busy noodle shop and you’re starving. Five people in line near you are eating five different bowls of soup. Three of them are eating the same yellow-broth thing. Two are eating something red. The waiter is staring at you.
You point at the yellow soup.
You’re alive. You ate something good. You did not waste a meal in Hanoi.
That was K-Nearest Neighbors. K equals five. The distance was “people standing physically near me.” The vote was “what most of them are eating.” Your prediction was “I should also eat that.” The hyperparameter was your willingness to look at strangers’ bowls without making it weird.

That’s the entire concept. Everything else is engineering.
When KNN does classification, it pulls the labels of the K nearest stored points and takes a vote. When KNN does regression, it pulls the values and takes the mean. There is no training. There is no neural net. There is no “learning” in any way your gym teacher would recognize. There’s just memory, distance, and democracy.

WHY THIS THING IS CALLED A “LAZY LEARNER” AND WHY THAT’S A COMPLIMENT
Most ML models work hard during training and rest during prediction.
Linear regression spends time during fit doing matrix math. Random forests build hundreds of trees in advance. Neural networks consume electricity for hours, sometimes days, sometimes the GDP of a small country. Then at inference, they multiply some numbers and answer in milliseconds.
KNN flips the deal.
Training is just storing the data. The algorithm goes “got it, those are my points.” That part takes the time it takes your hard drive to remember things. Fast. The expensive part is prediction, when KNN has to compute distances to potentially every stored point.
People call it a “lazy learner.” That’s an inside joke. KNN is not lazy. KNN is patient. It refuses to do work it doesn’t yet know is needed. It is the project manager of algorithms.
Why this matters If you have 50 million training points, every prediction without help requires 50 million distance computations. That works for prototyping. That does not work for production. Later in this article we’ll get into the speed tricks (KD trees, ball trees, HNSW, IVF) that make this practical at scale. For now, just remember that KNN’s bill arrives at predict time, not fit time.

THE QUICK TOUR OF KNN HIDING IN PLAIN SIGHT
You’ve used KNN this week. Probably this morning. Possibly within the last ten minutes.
When you opened TikTok and the For You page somehow knew you were into woodworking videos this month, that was approximate KNN over user embeddings. When you found a song on Spotify Discover Weekly that hit harder than anything in your library, that was KNN over song embeddings (Spotify open-sourced their library Annoy years ago). When you scanned a leaf with PlantNet and it told you the species in two seconds, the model first encoded the photo with a CNN and then asked KNN to find the nearest known plant.
When Hinge slid that one suspiciously perfect profile to the top, KNN was filtering candidates by similarity to people you’ve already liked. When your bank flagged a 4 AM gas station charge in a state you’ve never been to, the fraud system used KNN-style anomaly detection to notice the transaction was nowhere near your usual neighbors in feature space. When Shazam identified that song playing in the laundromat on the third try, it was matching audio fingerprints with a nearest-neighbor lookup. When Google reverse image search spat back the exact stock photo you stole from Pinterest, that was ANN over image embeddings.
Notice anything?
Almost everything you’d describe as “smart” in a consumer app is actually a similarity lookup wearing a confident jacket.

The interesting twist is that none of these production systems use vanilla KNN. They use approximate nearest neighbor methods (HNSW, IVF, LSH, ScaNN) because exact KNN over a billion points would take forever. But the heart is the same. The math your noodle shop self did in Hanoi is the math behind the recommendation that just made you laugh.
THE FIVE DISTANCE METRICS YOU’LL ACTUALLY USE
Here’s where most beginners face-plant. KNN’s quality is decided by your distance metric. Pick the wrong one, the model is wrong. Pick the right one, the model is suspiciously good. The metric is not a detail. The metric is the whole conversation.
Five metrics worth knowing. The half-thawed butter on my counter says we should keep this section moving.
- EUCLIDEAN (L2). The straight line.
distance = square_root( sum over each feature of (x_i — y_i) squared )
This is the Pythagoras you remember from school. Crow flies. Two points, one diagonal line. It’s the default in scikit-learn and it’s almost always fine for clean numeric features on similar scales.
When it shines. Continuous numeric data, low to moderate dimensions, features that are actually comparable.
When it betrays you. High dimensions. Mixed scales without standardization. Anything where direction matters more than magnitude.
There’s a story. The first time I used Euclidean on 768-dimensional sentence embeddings, the “nearest neighbor” of the query “I love dogs” turned out to be a tweet about parking tickets. I sat there blinking at my monitor for a full minute. The plastic bag stuck in the tree outside my window blew sideways. Then I switched to cosine. The next nearest neighbor was a tweet about golden retrievers. Lesson received.
- MANHATTAN (L1). The taxicab.
distance = sum over each feature of absolute_value(x_i — y_i)
You can’t fly through buildings in Manhattan. The cab follows the grid. Your distance is the sum of horizontal blocks plus vertical blocks. No shortcuts.
When it shines. Outlier-prone data (Manhattan punishes outliers less than Euclidean because squaring magnifies big differences). Grid data. High-dimensional spaces sometimes (more on that later).
When it betrays you. When your features represent something curved or directional and the L-shaped path is geometrically silly.

- MINKOWSKI. The parent.
distance = ( sum over each feature of absolute_value(x_i — y_i) raised to p ) raised to (1 divided by p)
Minkowski is the umbrella formula. With p=1 you get Manhattan. With p=2 you get Euclidean. With p=infinity you get Chebyshev (the max difference, used in chess for king moves and warehouse routing).
Why this matters. Scikit-learn’s KNN uses Minkowski with p=2 by default. So when you see metric=’minkowski’ in the docs, that’s just the parametric general form. Sometimes a fractional p (like p=1.5) helps on weird datasets, but I’ve only seen this matter twice in my career and both times the engineer was a research-paper kind of person.
- COSINE. The angle.
cosine_similarity = ( A dot B ) divided by ( magnitude_of_A times magnitude_of_B ) cosine_distance = 1 — cosine_similarity
Cosine measures the angle between two vectors. It throws away magnitude. Two vectors pointing the same way are similar even if one is much longer than the other.
When it shines. Text embeddings. Sentence transformers. Word2Vec. OpenAI embeddings. Image embeddings. Any embedding from any modern transformer model. Anything where the magnitude is a side effect of the model and not a real signal.
When it betrays you. Real-world numeric features where magnitude carries meaning, like prices, counts, or “how many cookies are left.”
The reason cosine became the default for embeddings is partly historical (the original word2vec paper used it) and partly practical (embedding magnitudes drift as models get fine-tuned, but directions stay stable). If you remember nothing else about distance metrics, remember this one. It’s the metric powering retrieval inside every RAG pipeline you’ll ever build.
- HAMMING. The string-counter.
distance = number of positions where two equal-length strings differ
Hamming counts mismatches between two equal-length sequences. The strings “1010” and “1001” have a Hamming distance of 2. So do “ACTG” and “ACGG.” So do two product SKUs that differ at the third and fifth positions.

When it shines. Binary features. DNA sequences (real bioinformatics tools use it). Hash similarity (used in image deduplication and plagiarism detection). One-hot encoded categoricals.
When it betrays you. Continuous data. Unequal-length strings (use Levenshtein for that). Mixed feature types.
There are more (Mahalanobis, Jaccard, Hellinger, edit distance, earth mover’s). Most of them are special-purpose. The five above will get you through 95 percent of the work.

PICKING K. THE PART THAT WILL SECRETLY BREAK YOUR MODEL.
K is one number. K is also the most dangerous number you’ll touch this month.
Set K too small. Your model becomes a paranoid librarian. K=1 means the predicted label is whatever the closest single point says. One misplaced training example, one mislabeled row, one weird outlier, all of it becomes its own little kingdom of bad predictions in the surrounding region. The decision boundary looks like the readout from a heart-rate monitor.
Set K too big. Your model becomes the regional manager who only trusts averages. K equals the size of the dataset means every prediction is the majority class. You’ve built a constant function with extra steps and a license fee.

How to pick K, ranked by how seriously you take your job.
The napkin rule. K equals the square root of the number of training points. 100 points, try K=10. 10000 points, try K=100. This is fine as a starting guess and bad as a final answer.
The odd-K rule for binary classification. Use odd K so votes can’t tie. Otherwise scikit-learn breaks ties by picking the class with the lower index, which is the algorithmic equivalent of flipping a coin you’ve already glued. You don’t want this.
The bias-variance way. Small K means low bias, high variance. The model wiggles. Big K means high bias, low variance. The model gets sleepy. Pick K in the middle and tune.
The grown-up method. K-fold cross-validation on the training set. Loop over K values from 1 to some upper bound (I usually stop at sqrt(N) plus 20). Plot validation error against K. Look for the elbow. Pick the K just past the bend.

A note that took me three years to learn. The right K depends on your data, your features, your noise level, and your distance metric. There is no universal K. Anyone who tells you K=5 is best for everything has either gotten lucky a lot or hasn’t run enough experiments.
I once shipped K=2 to production by accident. K=2 in a binary classifier means two votes, which in my dataset tied roughly 14 percent of the time. The system defaulted those ties to the negative class, which silently inflated my false negative rate by exactly the amount that triggered the next quarter’s customer churn analysis. The product manager never figured out why churn spiked. I have not told that product manager. They moved to a different company. The receipts in my wallet from that month are still there, faded.

THE CURSE OF DIMENSIONALITY. THE PART THE TEXTBOOKS GET WEIRD ABOUT.
Here’s something nobody tells you politely.
In low dimensions, distance behaves the way you’d expect. Some points are close, some are far, your algorithm has something to chew on. Easy mode.
In high dimensions, every point is roughly the same distance from every other point. The math is genuinely strange. The volume of a unit hypercube concentrates near its corners. The mass of a high-dimensional ball lives almost entirely in a thin shell near the surface. Your intuition, which was trained on three dimensions, files for unemployment.
The practical consequence is that “nearest neighbor” stops meaning anything useful. The ratio between the maximum and minimum distances among your points approaches 1 as dimensions grow. Everyone is equally far. KNN, which exists to ask “who’s closest,” loses the entire question.

Three things that save you.
One. Use cosine in high dim. Direction stays meaningful even when magnitudes flatten out. This is why every embedding-based search system uses cosine. Not because cosine is magic. Because Euclidean is broken there.
Two. Reduce dimensions. PCA, UMAP, t-SNE for visualization, autoencoders for production. Get your 2048-dimensional ResNet vector down to a 128-dim representation that still preserves structure. KNN can breathe again.
Three. Pick features. KNN treats every feature equally. A useless feature is not free. It’s a small leak in your distance function. Drop the dead weight.
The cat next door has been staring at an empty corner for ten minutes. I think she sees the curse of dimensionality and is judging it. Hard to say.
STANDARDIZATION. THE STEP NOBODY DOES UNTIL THEY DO IT WRONG.
KNN is allergic to feature scales. This is the single most-skipped step in beginner KNN code, and it’s why beginner KNN code is usually bad.
Imagine two features. Salary in dollars. Age in years. Salary ranges from 30000 to 200000. Age ranges from 18 to 80. The Euclidean distance between two people is dominated by salary differences. Age might as well not be there. Your “nearest neighbor” search has quietly become “nearest salary” search and you didn’t notice because the code ran fine.
The fix is one line. Standardize first.
StandardScaler transforms each feature to mean 0 and standard deviation 1. Now salary and age fight on the same field.
If your features have outliers, RobustScaler is better (it uses median and IQR instead of mean and std). If your features are bounded, MinMaxScaler is fine. The choice matters less than the act of doing it.
The rule that gets people fired. Fit the scaler on training data only. Apply it to test data with transform, never fit_transform. Otherwise you’ve leaked information from your test set into your model and your reported accuracy is fiction.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
That second line. Memorize the difference.

THE CODE. PROPER, WITH HONEST COMMENTS.
Time for actual code. We’ll build a KNN classifier on the Iris dataset (the “Hello World” of classification, 150 rows, 4 features, 3 species), then we’ll do K selection properly with cross-validation.
If you don’t have scikit-learn installed, run pip install scikit-learn first. The dishwasher will run while it installs. Both will finish around the same time.
numpy is the foundation of every numerical thing in Python.
pandas is for when I want to look at data in a table not a nested list.
matplotlib is for the part where I have to convince a stakeholder anything.
# Foundations. Numpy for math, pandas for tables, matplotlib for plots.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Scikit-learn pieces. Dataset, splitter, scaler, model, metrics.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Random seed so results repeat. Future me reruns this and the numbers match.
RNG = 42
# Load Iris. 150 rows, 4 features, 3 species.
iris = load_iris()
X = iris.data
y = iris.target
# Sanity check. Worse than a wrong model is a wrong dataset.
print("X shape", X.shape)
print("y shape", y.shape)
print("class names", iris.target_names)
# Split 80 / 20. Stratify keeps class proportions balanced across the split.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=RNG, stratify=y
)
# Standardize. KNN cares about scale. Always.
# Fit on train only. Apply to test with transform. Never fit on test.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Baseline KNN. K=5, equal-weight votes, Minkowski p=2 (Euclidean).
baseline = KNeighborsClassifier(
n_neighbors=5,
weights="uniform",
metric="minkowski",
p=2,
)
# "Training" for KNN is just memorization. Stores the data, maybe builds a tree.
baseline.fit(X_train_scaled, y_train)
# Predict. Distances actually get computed here.
y_pred = baseline.predict(X_test_scaled)
# Accuracy is fine because Iris is balanced. Use F1 / recall for imbalanced data.
print("Baseline accuracy", round(accuracy_score(y_test, y_pred), 4))
print(classification_report(y_test, y_pred, target_names=iris.target_names))
print(confusion_matrix(y_test, y_pred))
ks = list(range(1, 31))
cv_scores = []
cv_stds = []
for k in ks:
candidate = KNeighborsClassifier(n_neighbors=k)
fold_scores = cross_val_score(
candidate, X_train_scaled, y_train, cv=5, scoring="accuracy"
)
cv_scores.append(fold_scores.mean())
cv_stds.append(fold_scores.std())
best_k = ks[int(np.argmax(cv_scores))]
print("Best K via CV", best_k)
print("Best CV accuracy", round(max(cv_scores), 4))
final_model = KNeighborsClassifier(n_neighbors=best_k, weights="distance")
final_model.fit(X_train_scaled, y_train)
final_acc = accuracy_score(y_test, final_model.predict(X_test_scaled))
print("Final test accuracy", round(final_acc, 4))
A few things to notice in this script.
The model selection loop never sees the test set. The test set is held back for one final honest measurement. If you tune on the test set, you’re not measuring generalization. You’re measuring how well you fit your specific test set, which is not the same thing.
I picked accuracy because Iris is balanced. If you’re doing fraud detection where 99.9 percent of transactions are legit, accuracy is meaningless. Predicting “legit” always gets you 99.9 percent. Use F1 or recall in that case. The number that matters is the number that punishes the failure mode you actually care about.
The final model uses weights=’distance’. Closer neighbors get more vote. We’ll explain that next.

WEIGHTED KNN. WHEN VOTES SHOULDN’T BE EQUAL.
Default KNN gives every one of the K nearest neighbors a single equal vote. That’s democratic. It’s also a bit dumb if your closest neighbor is sitting on top of you and your fifth-closest is in a different time zone.
Weighted KNN says closer neighbors should count more. The standard scheme is inverse distance. A neighbor at distance 1 gets weight 1. A neighbor at distance 4 gets weight 0.25. The vote sum then favors local information.
In scikit-learn this is one parameter.
knn = KNeighborsClassifier(n_neighbors=5, weights="distance")
Why this matters. Weighted KNN often improves accuracy when your decision boundary is wiggly and local. It also helps with mild class imbalance because a small but very close cluster of minority-class neighbors can outvote a far but large group of majority-class neighbors.
When NOT to use it. When your distance metric is noisy or your features are high-variance. Inverse distance amplifies the influence of any neighbor that ends up close, including outlier-close ones. Garbage near, garbage out.
There’s a fancier version called distance-weighted with a kernel function. Gaussian kernels, Epanechnikov, tricube. They smooth the weighting. In practice, ‘uniform’ and ‘distance’ cover 95 percent of needs. The exotic kernels are for the academic conferences and the people whose research grants depend on them.

KD TREES, BALL TREES, AND WHY YOUR PREDICT CALL IS SLOW
Brute-force KNN computes the distance from your query point to every training point. For 1000 training points, fine. For 10 million, you’ll go grab a sandwich and come back to the same loading bar.
Two data structures cut the cost dramatically.
KD-Tree. Splits the feature space along axes recursively. The query traverses the tree and prunes branches that can’t possibly contain a closer point. Search complexity drops from O(N) to roughly O(log N) in low dimensions. Works well up to about 20 dimensions. After that, the bounding boxes become so loose they can’t prune effectively.
Ball Tree. Splits the space into nested hyperspheres. Better than KD-Tree in higher dimensions because hyperspheres handle skewed distributions better than axis-aligned boxes. Useful up to roughly 30–40 dimensions before performance degrades.
In scikit-learn
knn = KNeighborsClassifier(n_neighbors=5, algorithm="ball_tree")
or ‘kd_tree’ or ‘brute’ or ‘auto’. The ‘auto’ option lets sklearn pick based on your data, which is usually right.
For high-dimensional data (think 100 plus dimensions, the regime modern AI lives in), neither tree saves you. Bounding-box pruning fails because everyone is roughly the same distance away in high dim. You’re back to either brute force or approximate methods.
Which brings us to the reason vector databases exist.
APPROXIMATE NEAREST NEIGHBORS. THE TRICK BEHIND MODERN AI.
When you have a billion vectors (welcome to OpenAI, Anthropic, Pinecone, every big AI product), exact KNN is dead. Even with trees. You need approximate KNN, also called ANN.
Three big algorithms power most of the vector database industry.
HNSW (Hierarchical Navigable Small World). The current default. It builds a multi-layer graph where each node is connected to nearby nodes, with sparser connections at higher layers and denser connections at lower layers. A query starts at the top, hops greedily toward the target, and zooms down through layers. Fast, accurate, memory-hungry. Used by Pinecone, Qdrant, Weaviate, and basically everyone who launched after 2018. The original HNSW paper by Yury Malkov is one of the most cited papers in vector search.
IVF (Inverted File Index). Partitions the vector space into clusters using k-means. At query time you search only the K closest clusters instead of the whole index. Faster than HNSW for some workloads, less accurate. The technique is the workhorse inside Facebook’s FAISS library.
LSH (Locality-Sensitive Hashing). Hashes vectors so that similar vectors end up in the same bucket with high probability. Older, faster to build, less accurate. Spotify’s Annoy library uses a related tree-based approach. Google’s ScaNN is another well-known production-grade ANN system.
These all answer the same question. “Find me the K nearest neighbors of this query.” They just trade some accuracy for a lot of speed.

Why this matters for you. The next time you read a tech blog about vector databases or watch a webinar full of people saying “embeddings” forty times, you’ll know. The fundamental operation under the abstraction is KNN. Everything else is the engineering it took to make KNN fast enough to be useful at scale.
KNN IS THE BACKBONE OF RAG
Retrieval Augmented Generation. RAG. The acronym you’ve heard 200 times this year if you’re online at all.
Here’s the actual flow.
You upload your documents to a system. The system splits them into chunks (usually 200 to 800 tokens each, depending on the chunker). An embedding model (OpenAI’s text-embedding-3, Cohere’s embed-v3, BGE, sentence-transformers, take your pick) converts each chunk into a vector. The vectors live in a vector database.
You ask a question. The question is embedded the same way. The vector database finds the K nearest chunks to your question’s embedding using approximate KNN with cosine distance. Those K chunks are stuffed into the LLM’s prompt as context. The LLM answers using that context.
The “retrieval” in retrieval-augmented generation is K-nearest neighbors. It is, structurally and mathematically, exactly what we’ve been discussing.
I had a junior engineer last year tell me they “built RAG from scratch.” I asked them what algorithm the retrieval step used. They said “vector search.” I asked what vector search is. They said “you know, vector search.” We sat in silence. The streetlight outside flickered, which was timely. We Googled it together. KNN. It was always KNN.

KNN FOR THINGS YOU PROBABLY DIDN’T KNOW IT DOES
While we’re on the subject of where KNN hides, two underappreciated uses.
Imputation. KNNImputer in scikit-learn fills in missing values by averaging the values of the K nearest neighbors. It often beats simple mean imputation, especially when missingness is correlated with feature values. I’ve used it on customer data when whole columns were patchy. Worked better than I expected.
Anomaly detection. Algorithms like Local Outlier Factor (LOF) compare the local density around a point with the local densities of its K nearest neighbors. Points with much lower density than their neighbors are flagged. Used heavily in fraud detection, network intrusion detection, and quality control on manufacturing lines.
The pattern here. KNN is a primitive. Like sort, or hash, or for-loop. Once you see it, you start noticing it inside other algorithms.
WHEN NOT TO USE KNN
I’m going to be honest because too many tutorials worship the algorithm.

Skip KNN when your dataset is huge and your latency budget is tight. Brute force collapses, trees only help so much in high dim, and you’ll need an ANN library like FAISS or hnswlib. At that point you’re not running KNN. You’re running approximate KNN with engineering on top.
Skip it when your data is super high-dimensional and you can’t reduce it. The curse of dimensionality wins.
Skip it when you have a lot of irrelevant features. KNN treats every feature equally. Ten useful features plus one nonsense feature means 10 percent noise in every distance calculation. Random forests don’t care. KNN cares a lot.
Skip it when your dataset is wildly imbalanced. KNN gravitates toward majority classes. Use weighted KNN, oversample, undersample, or pick a different model.
Skip it when you need calibrated probability estimates. KNN gives you vote ratios, which are not real probabilities. They’re discrete and noisy. Logistic regression or a calibrated tree-based model is what you want.
Use KNN when you’re prototyping fast and want a baseline that’s hard to beat without thinking. Use it for embedding-based search, recommender systems, anomaly detection, image retrieval, semantic search, and RAG. Use it when interpretability matters and you can point at the actual neighbors that drove a prediction.

THE LESSONS I KEEP RELEARNING
A short field journal because I think you’ll save a weekend if you read this.
Always scale your features. I have made this mistake at least eleven times. Once it cost me a Sunday. The wind chime at the neighbor’s place sounded like a tiny bell choir for that entire Sunday. I have not forgotten.
Pick K with cross-validation. The square root rule is for the napkin, not the model card. Tune.
Use cosine for embeddings. Use Euclidean for clean numeric features. Don’t mix them up. This is a one-line difference that doubles or halves your accuracy.
KNN’s predict step is your bottleneck, not your fit step. Profile that.
When you scale to millions of points, switch from sklearn’s KNN to FAISS or hnswlib. The drop in code complexity is small. The drop in latency is enormous.
Distance metrics aren’t the only knob. The features you feed the model matter more than the metric. A bad feature is a permanent leak in your distance function. Standard scaler does not fix bad features.
KNN is amazing as a baseline. If your fancy model can’t beat KNN by a margin worth the complexity, ship KNN. I have killed two neural network projects this way. Nobody noticed. The bonus arrived on time.
Always look at the actual nearest neighbors. Not just the predicted label. Pull up the K nearest training points for some test queries and read them. Half the bugs in KNN-based systems are visible the moment you look at the neighbors. The other half are visible the moment you look at the features.

WHY KNN STILL MATTERS IN 2026
A practical truth about modern AI.
The biggest, loudest AI products you use are wrapping LLMs around retrieval pipelines. The retrieval pipelines are KNN-shaped. Vector databases are a multi-billion-dollar industry built around making “find K nearest” go fast at planet scale. Everyone selling you a fancy AI product has, somewhere in their stack, a slightly upgraded version of an algorithm Evelyn Fix and Joseph Hodges sketched out in 1951 inside a US Air Force technical report titled “Discriminatory Analysis.” Which, depending on how you read it, is either a punchline or an appropriately humble origin story.
Why is this the case?
Because intelligence at the application layer is mostly retrieval and ranking. Models that look smart are usually retrieving relevant things and ranking them by how much they match. KNN is the simplest, most direct expression of “give me the relevant things.” Everything else is plumbing, scaling, and branding.
If you only learn one classical machine learning algorithm in your life, learn this one. The moment you understand KNN over embeddings, you understand RAG. The moment you understand RAG, you understand how most modern AI products actually work. The moment you understand how most modern AI products actually work, the field stops feeling magical and starts feeling buildable. That’s a one-way door. You can’t unsee it.

A FEW QUESTIONS PEOPLE ASK ME ALL THE TIME
Is KNN supervised or unsupervised? Supervised. You need labels. There’s an unsupervised cousin called k-means, which is a clustering algorithm and a completely different beast. The K letter is a coincidence.
Can KNN do regression? Yes. KNeighborsRegressor in scikit-learn predicts the mean of the K nearest training values instead of voting on a class.
What’s the difference between K in KNN and K in k-means? KNN’s K is the number of neighbors. K-means’ K is the number of clusters. Same letter, different algorithm, different mental model. Naming is hard. Computer science is full of bad names.
Can I use KNN for time series? Sort of. You can use it on lagged features but it ignores temporal order. Use a model that respects time when time matters.
Do I need a GPU for KNN? Almost never. KNN is CPU-bound for most workloads. FAISS has a GPU mode for huge indices. A laptop runs KNN great for normal datasets.
Why is scikit-learn so good at KNN? Because the algorithm is genuinely that simple. The complexity is in the choices, not the math. Scikit-learn is the friend who already wrote the boring parts.
Is the Cover-Hart bound real? Yes. It’s a classic 1967 result. As your dataset grows toward infinity, the asymptotic error rate of 1-NN is at most twice the Bayes error rate (the theoretical best possible error for the problem). Translation. KNN is shockingly close to optimal in the limit. The smartest model you’ll ever build will be at most twice as good as KNN given enough data.
PARTING SHOT
KNN survived seven decades, four AI winters, three hype cycles, and a global obsession with deep learning. It survived because it works, it’s understandable, it’s modular, and the math is honest. There’s no hidden state. No black box. You can explain KNN to your grandmother in three minutes. You can explain it to your CTO in two.

K-Nearest Neighbors, From Iris Flowers to Reverse Image Search was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.