Skip to content
← Back
2026 Original

Softmax — A One-Page Primer

Darrell Thomas

Softmax — A One-Page Primer


What It Is

Softmax takes a list of numbers — any numbers, positive or negative, big or small — and converts them into probabilities that add up to 1.

Say you have three scores: [2, 1, 0.5]

Softmax turns them into something like: [0.59, 0.24, 0.17]

Note that these all sum up to 1.0, so essentially it turns a list of scores into a probability.

The biggest number gets the biggest share. The smallest gets the smallest share. But nothing becomes zero — every option keeps at least a sliver of probability.


How It Works

For each number in the list:

  1. Raise ee (approximately 2.718) to the power of that number.
  2. Divide by the sum of all those ee-to-the-power values.

For our example [2, 1, 0.5]:

  • e2=7.39e^2 = 7.39, e1=2.72e^1 = 2.72, e0.5=1.65e^{0.5} = 1.65
  • Sum = 11.76
  • Result: [7.39/11.76, 2.72/11.76, 1.65/11.76] = [0.63, 0.23, 0.14]

That’s the entire operation. The exponential (e-to-the-power) is what makes softmax “soft” — it exaggerates the differences between scores without completely ignoring the smaller ones.


Why It Matters in AI

In a Transformer, the attention mechanism computes a dot product between every pair of words, producing raw similarity scores. But raw scores aren’t useful on their own — you need to know how to weight the results. That’s where softmax comes in.

After computing dot products, the system runs softmax to convert those raw scores into a set of weights that sum to 1. A high score might become 0.7, meaning “pay a lot of attention to this word.” A low score might become 0.02, meaning “mostly ignore it.”

This shows up directly in the attention formula:

Attention(Q,K,V)=softmax ⁣(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V

The QKTQK^T part computes the raw dot-product scores. The softmax\text{softmax} converts them into attention weights. Then those weights are applied to the Values (VV) to produce the final output.

Without softmax, the model would have no principled way to decide how much to care about each word relative to the others.


Want the Deep Dive?

For a hands-on walkthrough of softmax and the functions that power neural networks, check out Artificial Intelligence By Example by Denis Rothman.


Primer by Darrell Thomas, 2026