A cross-encoder processes input pairs (e.g., a query and a document) jointly by concatenating them and feeding them into a single neural network. This allows the model to directly capture interactions between the two inputs through mechanisms like self-attention, which evaluates relationships between all tokens in the combined sequence. For example, in a question-answering task, a cross-encoder would analyze how each word in the question relates to each word in the answer candidate. In contrast, a bi-encoder processes each input independently, generating separate embeddings (vector representations) for each. Similarity is then computed using metrics like cosine similarity between these embeddings. For instance, a bi-encoder might encode a search query and a document separately and compare their precomputed vectors.
The key trade-off is efficiency versus accuracy. Bi-encoders are faster because embeddings can be precomputed for large datasets (e.g., millions of documents), enabling rapid similarity searches. This makes them suitable for initial retrieval stages in search systems. However, they may miss nuanced interactions between inputs. Cross-encoders, while slower due to processing pairs in real-time, achieve higher accuracy by modeling direct interactions. For example, a cross-encoder can detect that "bank" in "river bank" differs from "bank" in a financial context when paired with specific queries. However, they’re impractical for large-scale retrieval since comparing every possible pair is computationally expensive.
Use a bi-encoder when latency and scalability are critical, such as in real-time search engines or recommendation systems with vast candidate pools. For example, a product search feature might use a bi-encoder to quickly retrieve top candidates from a catalog. Choose a cross-encoder for tasks requiring high precision on smaller candidate sets, such as reranking the top 100 results from a bi-encoder or verifying entailment in NLP tasks. For instance, a QA system might first use a bi-encoder to find 100 answer candidates, then apply a cross-encoder to select the most relevant one. This hybrid approach balances speed and accuracy effectively.