Dual encoders are neural network architectures that generate embeddings for pairs of inputs, such as text queries and documents, by processing each input through separate encoder networks. The goal is to map semantically similar items close to each other in a shared vector space. For example, in a search system, one encoder might process a user's query while the other processes a document. Both encoders output embeddings (dense vectors) of the same dimensionality, and similarity between the query and document is measured using operations like cosine similarity or dot product. The key idea is that related pairs (e.g., a query and its relevant document) should have higher similarity scores than unrelated pairs.
The two encoders can be identical in architecture (e.g., both using BERT for text) or different, depending on the data types. During training, the model learns by contrasting positive pairs (e.g., a question and its correct answer) with negative pairs (e.g., the same question paired with random answers). A common approach uses contrastive loss, such as triplet loss, which enforces that the distance between a positive pair is smaller than the distance between the anchor and a negative example by a predefined margin. For instance, in a recommendation system, a user's interaction with an item (e.g., clicking a product) forms a positive pair, while non-interacted items serve as negatives. Alternatively, the InfoNCE loss (used in models like DPR) compares one positive against multiple negatives in a batch, normalizing scores to reflect probabilities of correctness.
In practice, dual encoders excel in efficiency because embeddings for items like documents or products can be precomputed and stored. At inference time, only the query needs encoding, and similarity is calculated via fast vector operations. For example, a search engine might precompute embeddings for all documents in its index, allowing real-time retrieval by comparing a user’s query embedding against the stored vectors. Challenges include ensuring encoders capture meaningful semantics and selecting effective negative samples. Using hard negatives (plausible but incorrect examples) instead of random ones often improves performance. Additionally, techniques like batch normalization or temperature scaling in the loss function can stabilize training. Overall, dual encoders balance accuracy and scalability, making them widely used in search, recommendation, and question-answering systems.