To prepare training data for fine-tuning a Sentence Transformer, you need structured pairs or triples of sentences that indicate similarity or dissimilarity. The format depends on the loss function used (e.g., contrastive loss for pairs, triplet loss for triples). For example, pairs consist of two sentences with a binary label (1 for similar, 0 for dissimilar), while triples include an anchor sentence, a positive (similar) sentence, and a negative (dissimilar) sentence. The data must be organized to reflect these relationships clearly and consistently.
For pairwise training, each example typically includes two sentences and a label. A common format is a CSV or TSV file with columns like sentence1
, sentence2
, and label
. For instance, in a semantic textual similarity task, a row might contain "The cat sits on the mat", "A feline is on the rug", 1
to denote similarity. For triplet loss, data is structured as anchor-positive-negative groups, such as (anchor="How to reset a password", positive="Steps to recover your login credentials", negative="Installing a new graphics driver")
. These triples can be stored in a CSV with columns anchor
, positive
, and negative
, or in a JSON array with corresponding keys.
When creating custom datasets, ensure the examples are relevant to your domain. For example, in a FAQ retrieval system, anchors could be user queries, positives the correct answers, and negatives incorrect but plausible answers. Use techniques like hard negative mining to select challenging negatives (e.g., answers from related but distinct topics) rather than random ones. Data augmentation (e.g., paraphrasing, back-translation) can increase diversity. Tools like the InputExample
class in the Sentence Transformers library help structure these pairs or triples during preprocessing, ensuring compatibility with the model’s dataloaders and loss functions. Proper formatting and quality of examples directly impact the model’s ability to learn meaningful embeddings.