To improve throughput when generating embeddings with Sentence Transformers, batching multiple sentences together is the most effective approach. Instead of processing sentences one at a time, you group them into batches, which reduces overhead and maximizes hardware utilization. Here’s how to implement this effectively:
1. Use the encode
Method with Batch Support
Sentence Transformers’ model.encode()
accepts a list of sentences and processes them in batches automatically. Specify the batch_size
parameter to control how many sentences are processed per batch. For example, model.encode(sentences, batch_size=32)
processes 32 sentences at once. Larger batches leverage GPU parallelism better but require more memory. Experiment with values like 64 or 128 (if your GPU allows) to find the optimal balance. If you encounter out-of-memory errors, reduce the batch size incrementally.
2. Optimize Inputs for Efficient Batching Sentence lengths impact performance because the model pads shorter sentences to match the longest one in a batch. To minimize wasted computation, sort sentences by length before batching. This reduces padding within each batch. For example:
sentences_sorted = sorted(sentences, key=lambda x: len(x.split()))
embeddings = model.encode(sentences_sorted, batch_size=64)
If sentences vary wildly in length, consider splitting them into groups (e.g., short, medium, long) and processing each group separately with tailored batch sizes.
3. Leverage Hardware and Framework Optimizations
- Mixed Precision (FP16): Enable FP16 inference if your GPU supports it by adding
convert_to_tensor=True
and ensuring your model supports FP16. This reduces memory usage and speeds up computation. - Async/Pipelining: Use PyTorch’s
DataLoader
to prefetch batches while the GPU processes the current batch. For example:
from torch.utils.data import DataLoader
loader = DataLoader(sentences, batch_size=64, shuffle=False)
embeddings = model.encode_multi_process(loader) # For multi-GPU
- Multi-GPU: Distribute batches across GPUs using
model.start_multi_process_pool()
andmodel.encode_multi_process()
for parallel processing.
Practical Considerations
- Start with a batch size that fits your GPU memory (e.g., 32–128 for typical GPUs).
- For CPU-only setups, smaller batches (8–16) often work best.
- Profile memory usage with tools like
nvidia-smi
(GPU) or memory monitors (CPU) to avoid bottlenecks.
By combining batched processing, input optimization, and hardware-specific tweaks, you can achieve significant throughput improvements—often 5–10x faster than single-sentence inference.