To handle encoding long documents with Sentence Transformers, the primary approach involves breaking the text into smaller segments that fit within the model’s token limit (typically 512 tokens). This is necessary because most transformer-based models cannot process sequences longer than their maximum input size. The two common strategies are fixed-size chunking and sliding window overlap. Fixed-size chunking splits the document into non-overlapping sections (e.g., paragraphs or sentences) using libraries like spaCy
or nltk
to maintain semantic boundaries. Sliding window approaches create overlapping chunks (e.g., 500 tokens per chunk with 50-token overlaps) to preserve context between segments. The choice depends on the task: fixed chunks are computationally efficient, while overlapping windows reduce information loss at boundaries.
After splitting, each chunk is encoded independently using Sentence Transformers, producing a set of embeddings. To create a single document representation, embeddings can be aggregated using methods like average pooling (mean of all chunk embeddings) or max pooling (taking maximum values across dimensions). For tasks requiring per-segment analysis (e.g., search or question answering), individual chunk embeddings are retained. For example, in retrieval tasks, each chunk might be indexed separately to match queries with relevant text spans. In classification, averaging embeddings often works well. Some models support [CLS]
token embeddings, but Sentence Transformers typically uses mean pooling by default, which is a safe starting point.
Key considerations include context preservation and computational cost. Overlapping chunks improve context retention but increase processing time. For very long documents, splitting with a tokenizer aligned to the model (e.g., the same tokenizer used in from_pretrained()
) ensures accurate token counting. Additionally, newer models like all-MiniLM-L12-v2
are optimized for shorter texts, while alternatives like Longformer (not natively in Sentence Transformers) handle longer sequences but require custom integration. Testing chunk sizes (e.g., 256 vs. 512 tokens) and overlap ratios on validation data helps balance performance and efficiency. Always preprocess text to remove noise (e.g., HTML tags) before splitting to avoid wasting token space on irrelevant content.