The size and type of a model directly influence RAG pipeline design by dictating how much context it can process, how precise retrieval must be, and the computational trade-offs involved. Larger models like GPT-3 have broader contextual understanding and longer token limits, enabling them to synthesize information from multiple documents even with some noise. Smaller models, such as 7B-parameter open-source LLMs, often have stricter context windows and weaker reasoning capabilities, requiring more targeted retrieval to provide sufficient signal. For example, GPT-3’s 16k-token context might allow processing 20 retrieved documents, while a smaller model with a 4k window might need stricter filtering to 5 high-quality documents. This affects how documents are chunked, indexed, and reranked.
Design choices for smaller models prioritize precision in retrieval and context efficiency. Since they lack the capacity to "fill gaps" in noisy data, retrieval components must use higher-quality embeddings (e.g., SPLADE instead of basic BERT) and aggressive reranking to surface the most relevant snippets. You might also preprocess documents into smaller chunks or summaries to fit tighter context limits. In contrast, larger models tolerate broader retrieval (e.g., BM25 with minimal reranking) because they can extract insights from less curated data. However, this comes with higher latency and cost—GPT-3 might require 10x more GPU memory, making it impractical for real-time applications without optimizations like dynamic batching.
Key metrics to evaluate these differences include recall@k (how many ground-truth relevant documents are retrieved), answer accuracy/F1 (generation quality), and latency per query. Smaller models may achieve comparable accuracy to larger ones only with higher recall@k (e.g., needing top-20 documents vs. top-5 for GPT-3), indicating their dependency on precise retrieval. Latency metrics would reveal trade-offs: a smaller model with optimized retrieval might match GPT-3’s speed despite weaker generation, while larger models might bottleneck on token generation. Additionally, context utilization efficiency (e.g., % of retrieved tokens actually used in the answer) could highlight how effectively each model type leverages provided documents, with smaller models often requiring tighter alignment between context and query.
