To implement embedding pooling strategies like mean, max, and CLS, you need to aggregate token-level embeddings from a transformer model into a single fixed-length vector. Each method has distinct mechanics and use cases. Mean pooling averages all token embeddings, max pooling extracts the highest value per dimension, and CLS pooling uses a dedicated token’s embedding. Below is a step-by-step breakdown of each approach, including code considerations and practical examples.
Mean Pooling computes the average of all token embeddings in a sequence. This works well for capturing general context but may dilute strong local features. To implement it, sum the embeddings along the sequence dimension and divide by the number of tokens. For example, in PyTorch, if embeddings
is a tensor of shape [batch_size, sequence_length, hidden_size]
, use torch.mean(embeddings, dim=1)
. However, when sequences are padded (e.g., with zeros), exclude padding tokens by first multiplying embeddings by an attention mask and dividing by the sum of mask values. For instance:
mask_expanded = attention_mask.unsqueeze(-1).expand(embeddings.size())
sum_embeddings = torch.sum(embeddings * mask_expanded, dim=1)
sum_mask = torch.clamp(mask_expanded.sum(1), min=1e-9)
mean_pooled = sum_embeddings / sum_mask
This ensures padding tokens don’t skew the average.
Max Pooling extracts the maximum value for each feature dimension across tokens, emphasizing the most prominent signals. For example, in text, this might highlight keywords like “amazing” in a sentiment analysis task. In PyTorch, use torch.max(embeddings, dim=1).values
. To handle padding, set padding token embeddings to a large negative value before applying max:
embeddings = embeddings.masked_fill(~mask_expanded.bool(), -1e9)
max_pooled = torch.max(embeddings, dim=1).values
This ensures padding values don’t accidentally become the maximum. Max pooling is useful when specific tokens dominate the meaning, but it can ignore contextual relationships between words.
CLS Pooling relies on a special [CLS]
token added by models like BERT, which is trained to capture sentence-level features. Here, you simply take the embedding of the first token (index 0). In code:
cls_pooled = embeddings[:, 0, :]
This method is efficient and model-specific. However, its effectiveness depends on whether the model was pretrained to use the CLS token for downstream tasks. For example, BERT’s CLS token is trained for classification, but if you’re using a model without this design, CLS pooling might underperform compared to mean or max. Always check the model’s documentation to confirm its CLS token’s intended use.
In practice, the choice depends on your task and model. Mean pooling is a safe default for general-purpose use, max pooling suits tasks where specific tokens matter most, and CLS pooling is ideal for models explicitly trained with it. Experimentation and validation on your specific data will determine the best approach.