To simulate realistic RAG latency, you must account for the entire end-to-end workflow, not just the model's inference time. This includes document retrieval, preprocessing, model loading, and infrastructure constraints. For example, retrieval latency depends on network delays, database query efficiency, and caching strategies. If documents are stored in a remote vector database, simulate network variability using tools like tc
to add artificial delays. Test both cached (warm) and uncached (cold) retrieval paths to account for real-world cache misses. Preprocessing steps like tokenization, embedding generation, or prompt formatting also add overhead—measure these by instrumenting each step in the pipeline.
Model loading and initialization are critical in serverless or autoscaling environments. Measure cold-start latency by forcing the system to load the model from disk or initialize a new inference session. For example, in AWS Lambda, cold starts can add several seconds to latency. Hardware constraints (e.g., GPU memory limits) also impact performance: simulate this by throttling resources (using docker run --memory
or Kubernetes resource limits) to see how latency degrades under CPU/RAM pressure. Concurrency is another factor: use load-testing tools like Locust to simulate multiple simultaneous requests and observe how retrieval and inference times scale under load.
Finally, include error handling and retries. For instance, if a document fetch fails, the system might retry the query or fall back to a secondary data source, increasing latency. Use fault injection tools (e.g., Chaos Monkey) to simulate database timeouts or network errors. Also, test with varying document sizes and counts—retrieving 10KB snippets versus 1MB PDFs will affect transfer and processing times. By combining these elements—infrastructure variability, workload diversity, and failure scenarios—you’ll create a benchmark that reflects real-world conditions, helping identify bottlenecks like slow database queries or insufficient GPU capacity.