A Sentence Transformer using a siamese (or twin) network structure during training means the model processes pairs of input sentences through two identical subnetworks with shared parameters. The goal is to learn embeddings where semantically similar sentences are closer in vector space, and dissimilar ones are farther apart. This architecture is called "siamese" because the twin networks mirror each other, ensuring consistent processing for both inputs. The shared weights prevent the model from treating the two inputs differently, which is critical for tasks like similarity comparison or contrastive learning.
The twin structure is often paired with loss functions like contrastive loss or triplet loss. For example, in triplet loss, the model takes an anchor sentence, a positive (similar) example, and a negative (dissimilar) example. The anchor and positive are passed through the twin networks, and the loss function penalizes the model if the anchor’s embedding isn’t closer to the positive than to the negative. This setup forces the model to learn fine-grained distinctions. Without shared weights, the networks could diverge, making embeddings incomparable. By tying parameters, the model ensures the same transformation applies to all inputs, creating a unified embedding space.
Practically, this design reduces computational overhead. Instead of training separate models for each input, weight sharing allows efficient processing of pairs or triplets. During inference, only one copy of the network is needed to encode any sentence. For example, in semantic search, a query and document are encoded using the same model, and their embeddings are compared. The siamese structure is foundational for tasks like paraphrase detection, where the model must distinguish subtle differences in meaning, and its efficiency makes it scalable for real-world applications like retrieval-augmented generation (RAG) systems.