Turbopuffer vs. Zilliz Cloud: A Performance and Cost Benchmark for Multi-Tenant Vector Search

Pure serverless vector databases promise a compelling deal: no infrastructure to manage, pay only for what you use, scale to zero when idle. For teams building multi-tenant AI applications — RAG pipelines, semantic search, AI assistants — the pitch is hard to ignore. Lower cost, less ops burden, faster time to production.
But what actually happens when you push a serverless vector database to production scale? When you have 128,000 tenants, real filter conditions, real delete workflows, and real users waiting for results?
We spent $500 and two weeks running a full evaluation of Turbopuffer, one of the serverless, S3-backed vector search solutions, against Zilliz Cloud, a purpose-built enterprise-grade vector database with tiered storage and built on top of the open-source Milvus.
Same data. Same region. Same client hardware. 160 million vectors. In this article, we focus on the performance and cost dimensions: search accuracy, query latency, write stability, rate limiting, and total cost of ownership.
What we found should matter to anyone evaluating a vector database for production.
Test Design
We modeled a multi-tenant retrieval SaaS — the architecture behind most production RAG, semantic search, and AI assistant applications — with three tenant tiers designed to mirror real-world distribution:
| Tenant Type | Count | Vectors per Tenant |
|---|---|---|
| Large (core enterprise customer) | 1 | 16,000,000 |
| Medium (standard customer) | 16 | 1,000,000 each |
| Small (long-tail / free-tier user) | 128,000 | 1,000 each |
Total data: 160M vectors, 768 dimensions, ~250 GB.
Environment: Both products were tested on AWS us-west-2 (Oregon). Test client: m4.xlarge (4 vCPU, 16 GB). Additional tests used m6i.xlarge (8c) and 16c clients. All ANN queries: top-k=10, nq=1. Full-text and hybrid search tests used a 20M-row Wikipedia dataset.
No warm-up runs before measuring cold queries. No cherry-picked configurations.
Finding 1: Search Accuracy Collapses Under Multi-Tenant Filtering
This was the most serious finding in our entire evaluation, and it has nothing to do with performance.
In a multi-tenant deployment, every query includes a filter — typically a tenant ID — to isolate results to a specific customer's data. This is not an edge case. It's how every multi-tenant vector search system works.
We ran 1,000 queries at top-100 against 10M vectors, then applied tenant filters at varying selectivity levels. The results:
| Filter Selectivity | Turbopuffer Recall | Zilliz Cloud Recall |
|---|---|---|
| Broad (id > 50%) | 0.78 | 0.99+ |
| Moderate (id > 90%) | 0.69 | 0.99+ |
| Narrow (id > 99%) — typical small tenant | 0.54 | 0.99+ |
At 0.54 recall, Turbopuffer is missing nearly half the relevant results. For every two documents that should appear in search results, one is silently absent.
The architectural reason is fundamental: Turbopuffer applies filters as post-processing on ANN results, rather than building filter-aware indexes. When the filter is selective — which it almost always is in a multi-tenant system — the ANN candidate pool contains too few matching documents, and recall collapses.
What makes this worse:
- No tuning parameters. There is no ef_search equivalent, no way to expand the candidate pool, no configuration to trade latency for accuracy.
- Top-k cap. We tried increasing top_k to retrieve more candidates and filter down. At the maximum allowed value of 1,200, the system returned only ~500 results. The workaround itself is broken.
- Silent failure mode. Turbopuffer doesn't flag low recall. Your users see fewer results or your RAG pipeline silently operates with half the context it should have. A recommendation engine might survive this — users don't see the results you missed. A search box or an AI assistant where the user is asking a specific question? They see an incomplete answer, or worse, a wrong one.
This is not a performance issue. It's a correctness issue. If your vector database returns wrong answers, nothing else in this benchmark matters.
Finding 2: Cold Query Latency — What Your Users Actually Experience
In any tiered-storage vector database, the cold query — the first query to a tenant whose data is not in cache — is the latency your user actually sees. In a system with 128,000+ tenants, the majority of queries at any given moment are cold or lukewarm. This is not an edge case; it's a steady state.
| Tenant Size | Turbopuffer | Zilliz Cloud | Difference |
|---|---|---|---|
| Small (1K vectors) | 206 ms | 161 ms | Zilliz 22% faster |
| Medium (1M vectors) | 1,127 ms | 181 ms | Zilliz 6.2x faster |
| Large (16M vectors) | 2,089 ms | 1,021 ms | Zilliz 2x faster |
Turbopuffer's cold latency increases 5.5x going from a small to a medium tenant (206 ms → 1,127 ms). Zilliz Cloud's moves from 161 ms to 181 ms — barely noticeable.
For your largest, most important customer — the enterprise account paying the most, expecting the best experience — Turbopuffer delivers a 2-second first-query latency. In applications like real-time RAG, customer support copilots, or search-powered product features, users notice anything above 300 ms. Two seconds is too long.
This isn't a tuning issue but a consequence of the S3-backed architecture. Serving a cold query requires multiple S3 GET requests (10-100 ms first-byte latency each), index deserialization, and reconstruction before the first search can run. In a separate evaluation at a different scale (10M vectors, code assistant workload), we observed cold p99 latencies reaching up to 4 seconds, with wide, unpredictable distributions.
The Random Tenant Penalty
We also uncovered an additional concern: querying random small tenants was over 100% slower than querying the same small tenant repeatedly.
Random small tenant average: 232 ms (P99: 447 ms) vs. fixed small tenant: 206 ms.
We confirmed that the overhead was not due to SDK namespace creation (measured at 0.002 ms). The most likely cause is cache eviction pressure under multi-tenant workloads — which is precisely the scenario Turbopuffer is marketed for. When 128K tenants compete for cache space, the tenant your user queries next is likely the one that was just evicted.
Finding 3: Write Ingestion Stalls Under Load
Both systems achieved comparable write throughput under normal conditions: 5-10 MB/s across tenant sizes, with a 100% success rate across all batches.
The difference showed up under sustained load on the large tenant.
During our 16M-vector large tenant ingestion, Turbopuffer blocked writes 3 times by returning HTTP 429, with the longest interruption lasting 7 minutes:
| Plain Text 12:08:01 - POST id_1600 → 429 Too Many Requests → Retry in 60s 12:09:02 - POST id_1600 → 429 Too Many Requests → Retry in 120s 12:11:03 - POST id_1600 → 429 Too Many Requests → Retry in 240s 12:15:04 - POST id_1600 → 200 OK ← 7 minutes later |
|---|
Turbopuffer applies backpressure when unindexed data reaches 2 GB. The SDK handles retry automatically, so it's invisible in your code — but your write pipeline silently stalls. In a production data sync, real-time event pipeline, or message queue consumer, a 7-minute stall cascades into backlogs, timeout errors, and stale data serving to users.
We observed no write blocks or throttling on Zilliz Cloud throughout the test.
Finding 4: The Billing Trap — The Cost You Calculated is Not the Bill You Finally Pay
Turbopuffer's pricing looks straightforward: pay per GB written, per GB queried, per GB stored. No cluster fees. The calculator is simple and the estimates are attractive.
We plugged a realistic workload into the calculator before testing: 10M vectors, 768 dimensions, 1,000 tenants, ~40 QPS. Estimate: 800-$1,200/month.
However, the actual bill on the production scale is: $1,000+/month.
The reason is a billing mechanic that's easy to miss: queried_bytes is charged based on the total size of the namespace being queried — not the data your query actually touches. Every query against a large tenant is billed against the full namespace size, regardless of whether you're retrieving top-10 from a 50 GB dataset. A query against a large tenant costs 10x more than the same query against a small tenant — not because it does 10x more work, but because of how billing is structured.
In real multi-tenant systems, tenant sizes follow a power law. Your largest tenants generate disproportionate query volume. Every one of those queries is billed at the full namespace rate. The calculator assumes uniform tenant sizes and can't model this.
What the Numbers Look Like at Scale
We also benchmarked the per-unit cost differences directly:
| Cost Component | Turbopuffer | Zilliz Cloud Tiered |
|---|---|---|
| Write cost | $1/GB | Free |
| Query cost | $0.005 per 1 GB queried | Free |
| Storage cost | ~$0.40/GB/month | ~$0.04/GB/month |
For a production workload at 200M documents (~600 GB):
- Monthly storage alone: Turbopuffer ~24. 8x difference.
- Annual write cost at 1 TB/year: Turbopuffer 0.
- Our 250 GB test write cost: Turbopuffer 0.
Zilliz Cloud charges a base cluster fee (starting at 0 entry point. But the moment you have production data and production query volume, Turbopuffer's marginal costs overtake. The more successful your product becomes — the more data you ingest, the more queries you serve — the faster the Turbopuffer bill climbs. Zilliz Cloud's marginal cost for additional writes and queries is zero.
Finding 5: Rate Limiting Under Concurrency
Turbopuffer enforces per-namespace concurrency limits. We hit them repeatedly:
- ANN queries: At 60 concurrent connections to a single small tenant, HTTP 429 errors began. At 200 concurrency on medium tenants, rate limiting was severe with dozens of failures.
- Hybrid search: Large tenant concurrent queries triggered 429 at 30 concurrency.
- Writes: Large tenant ingestion triggered 429 as described above.
The error:
| JSON { "error": "Too many concurrent queries to a single namespace.", "status": "error" } |
|---|
The implication: your highest-value tenants — the ones generating the most traffic — are the first to hit rate limits. There is no way to configure or raise these limits through the console. For any application where a single tenant can generate bursty traffic (a team of users searching simultaneously, a batch job, an integration partner), this is a hard ceiling you cannot engineer around.
We also observed connection reset by peer errors and broken pipe errors under high concurrency, suggesting infrastructure-level saturation beyond just rate limiting.
Finding 6: Full-Text and Hybrid Search Stability
We tested Turbopuffer's full-text and hybrid search against a 20M-row Wikipedia dataset:
Full-text search cold latency ranged from 640 ms (large tenant, best run) to 1,229 ms (medium tenant), with the large tenant averaging 1,001 ms across multiple runs. Hot QPS was solid: 868-1,688 depending on tenant size.
Hybrid search revealed significant instability:
- Large tenant cold latency: 718 ms to 2,446 ms across runs — a 3.4x variance.
- Large tenant concurrent queries triggered 429 rate limiting at just 30 workers.
- Both small and large tenant concurrent hybrid search had query failures.
The 3.4x variance in cold latency for the same query on the same data means you cannot set reliable SLAs for hybrid search. Combined with the recall issues described in Finding 1, this raises questions about hybrid search readiness for production workloads.
Summary
Here is a side-by-side summary of our performance and cost findings:
| Dimension | Turbopuffer | Zilliz Cloud |
|---|---|---|
| Search recall (narrow filter) | 0.54 — misses ~half of relevant results | 0.99+ |
| Cold query, medium tenant | 1,127 ms | 181 ms |
| Cold query, large tenant | 2,089 ms (up to 4s at p99) | 1,021 ms |
| Write stability | Blocked 3x, up to 7 min | No interruptions |
| Cost at 600 GB | Storage 8x more expensive; writes/queries charged | Free writes/queries |
| Rate limiting | Hard per-namespace cap, not configurable | Configurable |
| Hybrid search latency variance | 3.4x (718 ms – 2,446 ms) | Stable |
What to Test Before You Commit
If you're evaluating a vector database for production use, here are the performance and cost tests we'd recommend running before making a decision:
- Test recall with your actual filter conditions. Run your real multi-tenant filter (tenant ID, user group, permission scope) and measure recall at your actual selectivity levels. If recall drops below 0.95, your search results are unreliable.
- Measure cold query latency, not warm. Stop all queries for 30+ minutes, then measure the first query. This is what your users experience after any period of inactivity. Do this at your actual data scale per tenant.
- Run your cost model with real tenant distribution. Don't use the calculator with averages. Model your actual tenant size distribution (it's probably a power law) and calculate queried_bytes cost for your largest tenants at your actual query rate.
- Test concurrent queries against a single tenant. Simulate your peak traffic for your busiest tenant. Note when you first see 429 errors — that's your per-tenant ceiling.
All benchmark data was collected in December 2025 on AWS us-west-2. Both products were tested on their latest publicly available versions. The 160M-vector multi-tenant benchmark used identical data, configurations, and client hardware for both systems. Raw data, scripts, and detailed logs are available upon request.
In our next companion article, we'll examine the compliance and enterprise readiness dimensions of this evaluation — including delete consistency, GDPR implications, security certifications, and operational tooling. Stay tuned.
- Test Design
- Finding 1: Search Accuracy Collapses Under Multi-Tenant Filtering
- Finding 2: Cold Query Latency — What Your Users Actually Experience
- Finding 3: Write Ingestion Stalls Under Load
- Finding 4: The Billing Trap — The Cost You Calculated is Not the Bill You Finally Pay
- Finding 5: Rate Limiting Under Concurrency
- Finding 6: Full-Text and Hybrid Search Stability
- Summary
- What to Test Before You Commit
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for FreeKeep Reading

Data Deduplication at Trillion Scale: How to Solve the Biggest Bottleneck of LLM Training
Explore how MinHash LSH and Milvus handle data deduplication at the trillion-scale level, solving key bottlenecks in LLM training for improved AI model performance.

Zilliz Cloud Introduces Advanced BYOC-I Solution for Ultimate Enterprise Data Sovereignty
Explore Zilliz Cloud BYOC-I, the solution that balances AI innovation with data control, enabling secure deployments in finance, healthcare, and education sectors.

Legal Document Analysis: Harnessing Zilliz Cloud's Semantic Search and RAG for Legal Insights
Zilliz Cloud transforms legal document analysis with AI-driven Semantic Search and Retrieval-Augmented Generation (RAG). By combining keyword and vector search, it enables faster, more accurate contract analysis, case law research, and regulatory tracking.
