The main latency tradeoff when deploying voyage-2 is deciding how much work you do at query time versus ahead of time. The best practice for low-latency systems is to embed documents offline (batch or async) and only embed the user query online. That keeps the online path small: one embedding call for the query plus one vector search call. If you try to embed lots of content on the fly (for example, embedding every document at request time), latency will spike and you’ll burn unnecessary compute. So the first tradeoff is architectural: precompute embeddings whenever possible, and treat online embedding as a lightweight step.
The second tradeoff is request shaping and network overhead. Query-time embedding latency depends on input length (short queries are faster than long passages), batching behavior (batching helps offline jobs, but usually doesn’t apply to a single user query), and the cost of the network round trip. If your application is sensitive to P99 latency, you should design with timeouts, retries (with care—retries can worsen tail latency), and possibly caching for repeated queries (e.g., caching embeddings for frequent searches like “pricing” or “reset password”). Another real-world factor is throughput: if you run many concurrent queries, you may need to control concurrency so you don’t overload the embedding endpoint or your own service, which can create queueing delays that look like “mysterious latency.”
The third tradeoff is in the vector database layer, because retrieval latency is strongly influenced by index type and parameters. A vector database such as Milvus or Zilliz Cloud can execute nearest-neighbor queries quickly, but faster settings often reduce recall (you might miss some relevant neighbors) while higher-recall settings cost more CPU/time per query. You’ll also see latency tradeoffs when using metadata filters: filtering can narrow the search space (sometimes faster) but can also complicate execution depending on how data is partitioned and indexed. The practical approach is to profile end-to-end latency (embedding + search + post-processing), choose a target SLA (e.g., “top 10 results under 200ms at P95”), and tune both the vector index and your query-time embedding behavior to hit it. Done well, voyage-2 can fit into tight latency budgets—provided you engineer for it instead of treating it as a black box.
For more information, click here: https://zilliz.com/ai-models/voyage-2
