Combine Vertex AI with Milvus for semantic search by splitting responsibilities: use Vertex AI to generate embeddings and handle model inference, and use Milvus to store, index, and query vectors at scale. Start by chunking your corpus (documents, product descriptions, tickets) into passages with stable IDs and metadata (source, timestamp, permissions). Call a Vertex AI embedding endpoint (or a custom model you deployed) to produce vectors for each chunk, then upsert them into a Milvus collection with a chosen distance metric (cosine or L2) and an index such as IVF or HNSW. Persist metadata alongside vectors so you can filter by tenant, language, or tags at query time.
At query time, convert the user query into an embedding using the same Vertex AI model. Run a top-k ANN search in Milvus with optional filters, then assemble the retrieved passages into your application’s response. For a RAG flow, pass the retrieved text as context to a generation endpoint (e.g., a Gemini model) to produce grounded answers. To improve ranking precision, add a re-ranker: either a lightweight cross-encoder you host on Vertex AI or a short re-ranking pass using a model call. Log candidate sets and final choices to evaluate recall@k, MRR, and answer accuracy against a validation suite.
Operationally, treat embeddings as a versioned artifact. When you update the embedding model, build a parallel Milvus collection and dual-write during backfill. Shadow queries against both collections to compare quality and latency; cut over only when metrics improve. Tune index parameters (e.g., IVF nlist/nprobe, HNSW M/efSearch) to balance recall and latency, and consider Product Quantization if memory pressure is high. Monitor end-to-end with dashboards: p95 query latency, ANN recall, drift between old/new embeddings, and failure rates. This architecture is predictable: Vertex AI focuses on model quality and throughput; Milvus ensures fast, filtered semantic retrieval.
