The Geometric Interpretation Of Linear Regression

When we first learn linear regression, we think of it like this: we have data points scattered on a plane, we draw a line through them, and we adjust the line until the total squared error is as small as possible. We take a derivative, set it to zero, and out pops the formula for the best-fit line.

This picture works. But it hides something beautiful. Let me explain.

Suppose we want to predict a quantity y based on several input quantities x₁ , x₂ , … , xₚ . We have data from n past observations: for each observation i , you know the inputs

and the actual outcome yᵢ .

We assume the relationship is linear in the parameters:

where εᵢ is a random error term- noise, unobserved variables, model misspecification, the universe’s refusal to be perfectly linear. The goal is to estimate β₀, β₁, …, βₚ from the data.

For any choice of parameters, the predicted value for observation i is:

and the residual is:

We want parameters that make all these residuals small simultaneously, so we minimize the sum of squared residuals:

Now, you could take p + 1 partial derivatives, set them all to zero, and solve. But the resulting system of equations is notationally painful. Matrices clean this up. We stack the observations into a vector y ∈ Rⁿ , the parameters into β ∈ R⁽ᵖ⁺¹⁾, and build the design matrix:

The column of ones handles the intercept. You can verify that the i-th component of Xβ is ŷᵢ , so the entire model collapses to y = Xβ + ε , and the sum of squared residuals becomes || y — Xβ ||² .

This is where most treatments take the derivative of || y — Xβ ||² with respect to β, set it to zero, and arrive at the normal equations. It works. But there is a far more revealing path- one that requires no calculus at all.

We adopt a completely different viewpoint: Think of each observation as a dimension.

If we have n observations, we work in Rⁿ. The response vector y = (y₁, y₂, … , yₙ)ᵀ is one geometric object, encoding all our observed responses simultaneously.

This seems strange at first. We’re treating observations as dimensions rather than as points. But this viewpoint is what makes everything click.

We’ve written X as a matrix of rows- one row per observation. But now read it column by column:

Each column X₍ᵢ₎ is a vector in Rⁿ and contains all n observations of the i-th feature.

So where do our predictions live?

The column space of X, denoted C(X), is the set of all linear combinations of these column vectors:

or equivalently,

Now here’s the crucial observation. For any choice of β, the i-th component of Xβ is:

This is exactly the predicted value for observation i. So the vector Xβ is precisely the vector of all predictions ŷ. So every possible prediction vector that our linear model can produce lies in C(X). This is a linear subspace of Rⁿ; it contains the zero vector, and it’s closed under addition and scalar multiplication. If X has full column rank, then dim⁡(C(X)) = p+1. Also, we typically have n ≫ p+1. Many more observations than parameters. So we can visualize C(X) as a “small” subspace sitting inside the much “larger” space Rⁿ.

Now let’s look at the original optimization problem. It said: find β minimizing

But look at what this sum actually is:

The sum of squared residuals is the squared Euclidean distance from y to ŷ in Rⁿ. So minimizing the sum of squared residuals is the same thing as finding the point ŷ ∈ C(X) that is closest to y.

So we have a point and a subspace, and we want the nearest point on the subspace. And this problem has a clean, classical answer from linear algebra.

Think of it in 3D first: if you have a plane through the origin and a point above the plane, the closest point on the plane is found by dropping a perpendicular from the point straight down to the plane. The connecting line- the residual- hits the plane at a right angle.

This is exactly what happens in Rⁿ. By the Projection Theorem, the unique closest point ŷ ∈ C(X) to y is characterized by a single elegant condition:

The residual vector e = y — ŷ must be orthogonal to the entire column space. ŷ is called the orthogonal projection of y onto C(X).

This is the geometric heart of linear regression.

Now comes the payoff. We can derive the famous normal equations purely from the orthogonality condition, with no derivatives at all.

We need e ⊥ C(X), meaning e is perpendicular to every vector in the column space. But C(X) is spanned by the columns of X, so it suffices to check perpendicularity against each column:

Perpendicularity to the basis implies perpendicularity to the whole space.
For each column j,

Stacking these equations:

But the left matrix- where each row is a transposed column of X- is exactly Xᵀ. So:

Now substitute e = y − ŷ = y − Xβ̂ :

These are the normal equations- the same equations you’d get by differentiating the loss function and setting the gradient to zero. But we derived them from a single geometric principle: the residual is perpendicular to the column space.

If X has full column rank, the normal equations give

So:

We define the hat matrix:

Then ŷ = Hy. The matrix H maps any vector in Rⁿ to its orthogonal projection onto C(X).

H is a projection matrix, meaning it satisfies two properties:
i) Idempotence, i.e, H² = H. Why? Because projecting twice is the same as projecting once- if a vector is already on the subspace, projecting it again changes nothing.
ii) Symmetry, i.e, Hᵀ = H. This is what makes the projection orthogonal rather than oblique.
Both these properties can be verified for H.

This geometric perspective is profound because it reveals that linear regression is fundamentally a problem from linear algebra, specifically, orthogonal projection onto a subspace.

Reference:
Gilbert Strang, Introduction to Linear Algebra- Chapter 4 on orthogonality and projections.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning- Chapter 3 on linear methods for regression.

The Geometric Interpretation Of Linear Regression was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment