To effectively integrate retrieved documents into an LLM’s input, modifications to input formatting are often necessary. A common approach is adding special tokens to demarcate sections of the input. For example, tokens like [CONTEXT]
and [/CONTEXT]
can signal the start and end of retrieved content, helping the model distinguish between the original query and external information. Separators like [DOC1]
or [SOURCE]
can also differentiate multiple documents, enabling the model to process them as distinct units. Additionally, positional encodings or segment embeddings can be adjusted to account for the extended input length caused by adding documents. For instance, using separate embeddings for the query, retrieved context, and document boundaries (e.g., BERT-style [SEP]
tokens) allows the model to better track relationships between sections. Without such markers, the model may struggle to prioritize relevant context or parse long inputs cohesively.
Architectural adjustments are often needed to handle retrieved data efficiently. Sparse attention mechanisms, like those in Longformer or Sparse Transformers, reduce computational overhead when processing lengthy documents. Alternatively, a Fusion-in-Decoder (FiD) architecture processes each retrieved document independently with an encoder, then aggregates their outputs in the decoder, which improves scalability. Another approach involves adding cross-attention layers between the retrieved documents and the main input, enabling the model to dynamically focus on relevant passages. For example, models like RETRO use a separate encoder for retrieved content and integrate it via cross-attention during generation. Fine-tuning the model on tasks that require combining query and context (e.g., question answering) is also critical, as it teaches the model to leverage the added structure. Without architectural changes, the model may fail to effectively utilize the retrieved information due to input length limits or insufficient attention mechanisms.
Practical implementation requires balancing input design and model capabilities. For instance, if retrieved documents are numerous or lengthy, truncating or selecting top-ranked snippets prevents exceeding token limits. Weighting mechanisms—such as attention biases toward [CONTEXT]
sections or using retriever confidence scores as input features—can help prioritize high-quality documents. Extending the embedding layer to include new tokens (e.g., [CTX]
) and adjusting positional encoding ranges are necessary steps during fine-tuning. Testing formats like “Context: {documents} Question: {query}” (used in T5) or prepending metadata (e.g., document titles) can further enhance performance. Developers should also evaluate trade-offs: while adding tokens and segments improves clarity, it increases input length and may require model retraining. Similarly, architectural changes like FiD improve document handling but add complexity. Iterative experimentation with formatting and architecture is key to optimizing context utilization without overcomplicating the system.