To speed up embedding generation, three primary techniques are precision reduction, model quantization, and graph optimization via formats like ONNX. Each method balances speed, memory efficiency, and accuracy, and their effectiveness depends on hardware support and model architecture.
1. FP16/Mixed Precision: Using 16-bit floating-point (FP16) instead of 32-bit (FP32) reduces memory usage and computation time. Modern GPUs like NVIDIA Tensor Core architectures execute FP16 operations faster and can process larger batches. For example, PyTorch’s autocast
enables mixed precision by automatically casting parts of the model to FP16 while keeping sensitive layers in FP32 to avoid precision loss. This can double throughput with minimal accuracy impact. However, FP16 may cause underflow/overflow in models not designed for lower precision. Libraries like NVIDIA’s Apex or TensorFlow’s mixed_precision
mitigate this by scaling gradients during training, but for inference, enabling FP16 mode in frameworks like Hugging Face’s Transformers (via torch_dtype=torch.float16
) often suffices.
2. Model Quantization: Quantization reduces the numerical precision of weights and activations, often from FP32 to 8-bit integers (INT8). Post-training quantization (PTQ) applies this after training with minimal calibration data, while quantization-aware training (QAT) simulates lower precision during training for better accuracy. For instance, using TensorFlow Lite’s TFLiteConverter
or PyTorch’s quantize_dynamic
can shrink model size by 4x and accelerate inference on CPUs or NPUs with INT8 support. However, embeddings from quantized models may lose subtle semantic nuances. Tools like ONNX Runtime’s quantization toolkit allow per-layer adjustments to balance speed and quality, which is critical for tasks like retrieval-augmented generation (RAG) where embedding quality affects downstream results.
3. ONNX and Graph Optimizations: Converting models to ONNX standardizes the computation graph, enabling framework-agnostic optimizations. ONNX Runtime applies techniques like operator fusion (combining layers to reduce overhead) and kernel optimizations for specific hardware. For example, a BERT model exported to ONNX can see a 20-30% speedup on CPUs via graph simplifications and parallelization. Additionally, ONNX models can leverage hardware-specific toolkits like TensorRT or OpenVINO for further optimizations. However, converting models with custom operations (e.g., PyTorch’s nn.Module
subclasses) may require rewriting layers in ONNX-compatible ops. Tools like torch.onnx
provide workarounds, but complex architectures may need manual adjustments.
Other methods include pruning (removing redundant weights), distillation (training smaller models to mimic larger ones), and batching (processing multiple inputs simultaneously). For instance, combining FP16 with dynamic batching in NVIDIA Triton Inference Server maximizes GPU utilization. Each technique involves trade-offs: quantization and pruning may require retraining or calibration, while ONNX conversion demands compatibility testing. Choosing the right approach depends on the target hardware (e.g., TPUs favor bfloat16), latency requirements, and acceptable accuracy thresholds.