Attention in transformers is calculated using three vectors: Query (Q), Key (K), and Value (V). For each token in the input, the query vector represents what it is looking for, the key vector encodes what it offers, and the value vector contains the information passed along.
The attention score for a token is obtained by taking the dot product of its query vector with the key vectors of all other tokens in the sequence. These scores are scaled by the square root of the dimension size and passed through a softmax function to normalize them into probabilities. These probabilities are then used to compute a weighted sum of the value vectors, resulting in the final attention output for each token.
Multi-head attention extends this by splitting the computation into multiple heads, each focusing on different aspects of the sequence. The outputs from all heads are concatenated and processed through a linear layer. This mechanism allows transformers to capture complex relationships across tokens and is a key reason for their success in LLMs.