If you encounter NaN or infinite values in the loss during Sentence Transformer training, start by checking learning rates and optimizer settings. A high learning rate can cause gradient updates to overshoot stable minima, leading to exploding gradients and numerical instability. For example, using a learning rate above 1e-4 without a warmup schedule can destabilize training. Reduce the learning rate (e.g., to 1e-5) and enable gradient clipping (e.g., max_grad_norm=1.0
) to limit large updates. Additionally, verify that your optimizer (e.g., Adam) isn’t accumulating numerical errors in its momentum terms—this can happen with mixed-precision training if not properly configured.
Next, inspect your input data and preprocessing. NaN values often arise from invalid or poorly scaled inputs. Check for empty strings, malformed text, or extreme values in tokenized inputs (e.g., due to rare tokens or incorrect tokenizer settings). For triplet or contrastive loss, ensure that positive/negative pairs are correctly formatted and that embeddings aren’t collapsing (e.g., all outputs becoming nearly identical). For example, if all sentence embeddings are near-zero, cosine similarity calculations might produce NaN. Normalize embeddings (e.g., using L2 normalization) or scale similarity scores to avoid division by zero in loss functions like Softmax or CrossEntropy.
Finally, review the model architecture and numerical operations. Debug the loss function implementation: for instance, using log
operations without clamping (e.g., log(0)
) can produce NaNs. Check for numerical precision issues in mixed-precision training—disable AMP (Automatic Mixed Precision) temporarily to isolate the problem. If using custom layers, validate that operations like attention scores or pooling aren’t producing invalid values. For example, an uninitialized projection layer might output extreme values. Test with a smaller subset of data and a simplified model (e.g., fewer layers) to identify the root cause. Tools like gradient checking or logging intermediate tensor values (e.g., embeddings, loss components) can pinpoint where NaNs first appear.