Evaluate embeddings for agentic RAG by testing retrieval recall, semantic consistency, and agent loop efficiency on domain-specific benchmarks.
Evaluation framework:
1. Retrieval recall@k:
- Create ground-truth pairs: (query, relevant_documents)
- For each query, count how many relevant docs appear in top-k retrieved results
- Calculate recall = (relevant_docs_found / total_relevant_docs)
- Target: >80% recall@5 for your domain
2. Semantic consistency test:
- Query variations should retrieve the same document
- Example: "What happened in Q4?" and "What was the outcome in October–December?" should both retrieve Q4 reports
- Measure: percentage of query variations retrieving the same top result
- Target: >90% consistency
3. Agent loop efficiency:
- Run agents on test queries
- Count average loops needed to answer
- Measure context tokens consumed
- Target: median of 2–3 loops, <500 tokens per query
4. Domain adaptation:
- Test on supply chain queries, legal queries, customer support queries separately
- Some embeddings excel at semantic understanding but fail on domain-specific terminology
- Choose embeddings with domain-specific fine-tuning if available
5. Latency at scale:
- Index 1M+ embeddings in Zilliz Cloud
- Measure p95 query latency
- Target: <100ms for single query, <500ms for agent loop (3 queries)
Recommended test set: Use MTEB benchmarks + your own domain queries. Include edge cases (misspellings, abbreviations, acronyms).
Poor embeddings are the #1 cause of agent loop failures. Invest time in evaluation with Zilliz Cloud.
Related Resources: