The 1/√d_k scaling in attention isn’t Numerical Stability: Here’s the actual math and why it breaks without it [D]
Every resource says "We scale by 1/√d_k to prevent softmax saturation." Almost none of them explain why saturation happens or why that specific scaling constant appears. When you compute Q·Kᵀ without scaling, each element is a dot product of …