Blog
Turbopuffer vs. Zilliz Cloud: A Performance and Cost Benchmark for Multi-Tenant Vector Search

Turbopuffer vs. Zilliz Cloud: A Performance and Cost Benchmark for Multi-Tenant Vector Search

Mar 24, 202610 min read

Pure serverless vector databases promise a compelling deal: no infrastructure to manage, pay only for what you use, scale to zero when idle. For teams building multi-tenant AI applications — RAG pipelines, semantic search, AI assistants — the pitch is hard to ignore. Lower cost, less ops burden, faster time to production.

But what actually happens when you push a serverless vector database to production scale? When you have 128,000 tenants, real filter conditions, real delete workflows, and real users waiting for results?

We spent $500 and two weeks running a full evaluation of Turbopuffer, one of the serverless, S3-backed vector search solutions, against Zilliz Cloud, a purpose-built enterprise-grade vector database with tiered storage and built on top of the open-source Milvus.

Same data. Same region. Same client hardware. 160 million vectors. In this article, we focus on the performance and cost dimensions: search accuracy, query latency, write stability, rate limiting, and total cost of ownership.

What we found should matter to anyone evaluating a vector database for production.

Test Design

We modeled a multi-tenant retrieval SaaS — the architecture behind most production RAG, semantic search, and AI assistant applications — with three tenant tiers designed to mirror real-world distribution:

Tenant Type	Count	Vectors per Tenant
Large (core enterprise customer)	1	16,000,000
Medium (standard customer)	16	1,000,000 each
Small (long-tail / free-tier user)	128,000	1,000 each

Total data: 160M vectors, 768 dimensions, ~250 GB.

Environment: Both products were tested on AWS us-west-2 (Oregon). Test client: m4.xlarge (4 vCPU, 16 GB). Additional tests used m6i.xlarge (8c) and 16c clients. All ANN queries: top-k=10, nq=1. Full-text and hybrid search tests used a 20M-row Wikipedia dataset.

No warm-up runs before measuring cold queries. No cherry-picked configurations.

Finding 1: Search Accuracy Collapses Under Multi-Tenant Filtering

This was the most serious finding in our entire evaluation, and it has nothing to do with performance.

In a multi-tenant deployment, every query includes a filter — typically a tenant ID — to isolate results to a specific customer's data. This is not an edge case. It's how every multi-tenant vector search system works.

We ran 1,000 queries at top-100 against 10M vectors, then applied tenant filters at varying selectivity levels. The results:

Filter Selectivity	Turbopuffer Recall	Zilliz Cloud Recall
Broad (id > 50%)	0.78	0.99+
Moderate (id > 90%)	0.69	0.99+
Narrow (id > 99%) — typical small tenant	0.54	0.99+

At 0.54 recall, Turbopuffer is missing nearly half the relevant results. For every two documents that should appear in search results, one is silently absent.

The architectural reason is fundamental: Turbopuffer applies filters as post-processing on ANN results, rather than building filter-aware indexes. When the filter is selective — which it almost always is in a multi-tenant system — the ANN candidate pool contains too few matching documents, and recall collapses.

What makes this worse:

No tuning parameters. There is no ef_search equivalent, no way to expand the candidate pool, no configuration to trade latency for accuracy.
Top-k cap. We tried increasing top_k to retrieve more candidates and filter down. At the maximum allowed value of 1,200, the system returned only ~500 results. The workaround itself is broken.
Silent failure mode. Turbopuffer doesn't flag low recall. Your users see fewer results or your RAG pipeline silently operates with half the context it should have. A recommendation engine might survive this — users don't see the results you missed. A search box or an AI assistant where the user is asking a specific question? They see an incomplete answer, or worse, a wrong one.

This is not a performance issue. It's a correctness issue. If your vector database returns wrong answers, nothing else in this benchmark matters.

Finding 2: Cold Query Latency — What Your Users Actually Experience

In any tiered-storage vector database, the cold query — the first query to a tenant whose data is not in cache — is the latency your user actually sees. In a system with 128,000+ tenants, the majority of queries at any given moment are cold or lukewarm. This is not an edge case; it's a steady state.

Tenant Size	Turbopuffer	Zilliz Cloud	Difference
Small (1K vectors)	206 ms	161 ms	Zilliz 22% faster
Medium (1M vectors)	1,127 ms	181 ms	Zilliz 6.2x faster
Large (16M vectors)	2,089 ms	1,021 ms	Zilliz 2x faster

Turbopuffer's cold latency increases 5.5x going from a small to a medium tenant (206 ms → 1,127 ms). Zilliz Cloud's moves from 161 ms to 181 ms — barely noticeable.

For your largest, most important customer — the enterprise account paying the most, expecting the best experience — Turbopuffer delivers a 2-second first-query latency. In applications like real-time RAG, customer support copilots, or search-powered product features, users notice anything above 300 ms. Two seconds is too long.

This isn't a tuning issue but a consequence of the S3-backed architecture. Serving a cold query requires multiple S3 GET requests (10-100 ms first-byte latency each), index deserialization, and reconstruction before the first search can run. In a separate evaluation at a different scale (10M vectors, code assistant workload), we observed cold p99 latencies reaching up to 4 seconds, with wide, unpredictable distributions.

The Random Tenant Penalty

We also uncovered an additional concern: querying random small tenants was over 100% slower than querying the same small tenant repeatedly.

Random small tenant average: 232 ms (P99: 447 ms) vs. fixed small tenant: 206 ms.

We confirmed that the overhead was not due to SDK namespace creation (measured at 0.002 ms). The most likely cause is cache eviction pressure under multi-tenant workloads — which is precisely the scenario Turbopuffer is marketed for. When 128K tenants compete for cache space, the tenant your user queries next is likely the one that was just evicted.

Finding 3: Write Ingestion Stalls Under Load

Both systems achieved comparable write throughput under normal conditions: 5-10 MB/s across tenant sizes, with a 100% success rate across all batches.

The difference showed up under sustained load on the large tenant.

During our 16M-vector large tenant ingestion, Turbopuffer blocked writes 3 times by returning HTTP 429, with the longest interruption lasting 7 minutes:

Plain Text 12:08:01 - POST id_1600 → 429 Too Many Requests → Retry in 60s 12:09:02 - POST id_1600 → 429 Too Many Requests → Retry in 120s 12:11:03 - POST id_1600 → 429 Too Many Requests → Retry in 240s 12:15:04 - POST id_1600 → 200 OK ← 7 minutes later

Turbopuffer applies backpressure when unindexed data reaches 2 GB. The SDK handles retry automatically, so it's invisible in your code — but your write pipeline silently stalls. In a production data sync, real-time event pipeline, or message queue consumer, a 7-minute stall cascades into backlogs, timeout errors, and stale data serving to users.

We observed no write blocks or throttling on Zilliz Cloud throughout the test.

Finding 4: The Billing Trap — The Cost You Calculated is Not the Bill You Finally Pay

Turbopuffer's pricing looks straightforward: pay per GB written, per GB queried, per GB stored. No cluster fees. The calculator is simple and the estimates are attractive.

We plugged a realistic workload into the calculator before testing: 10M vectors, 768 dimensions, 1,000 tenants, ~40 QPS. Estimate: $220/month. At that price, it's a no-brainer compared to dedicated infrastructure at$ 800-$1,200/month.

However, the actual bill on the production scale is: $1,000+/month.

The reason is a billing mechanic that's easy to miss: queried_bytes is charged based on the total size of the namespace being queried — not the data your query actually touches. Every query against a large tenant is billed against the full namespace size, regardless of whether you're retrieving top-10 from a 50 GB dataset. A query against a large tenant costs 10x more than the same query against a small tenant — not because it does 10x more work, but because of how billing is structured.

In real multi-tenant systems, tenant sizes follow a power law. Your largest tenants generate disproportionate query volume. Every one of those queries is billed at the full namespace rate. The calculator assumes uniform tenant sizes and can't model this.

What the Numbers Look Like at Scale

We also benchmarked the per-unit cost differences directly:

Cost Component	Turbopuffer	Zilliz Cloud Tiered
Write cost	$1/GB	Free
Query cost	$0.005 per 1 GB queried	Free
Storage cost	~$0.40/GB/month	~$0.04/GB/month

For a production workload at 200M documents (~600 GB):

Monthly storage alone: Turbopuffer ~ $195 vs. Zilliz Cloud ~$ 24. 8x difference.
Annual write cost at 1 TB/year: Turbopuffer $1,000 vs. Zilliz Cloud$ 0.
Our 250 GB test write cost: Turbopuffer $500.97 vs. Zilliz Cloud$ 0.

Zilliz Cloud charges a base cluster fee (starting at $64/month), which is higher than Turbopuffer's$ 0 entry point. But the moment you have production data and production query volume, Turbopuffer's marginal costs overtake. The more successful your product becomes — the more data you ingest, the more queries you serve — the faster the Turbopuffer bill climbs. Zilliz Cloud's marginal cost for additional writes and queries is zero.

Finding 5: Rate Limiting Under Concurrency

Turbopuffer enforces per-namespace concurrency limits. We hit them repeatedly:

ANN queries: At 60 concurrent connections to a single small tenant, HTTP 429 errors began. At 200 concurrency on medium tenants, rate limiting was severe with dozens of failures.
Hybrid search: Large tenant concurrent queries triggered 429 at 30 concurrency.
Writes: Large tenant ingestion triggered 429 as described above.

The error:

JSON { "error": "Too many concurrent queries to a single namespace.", "status": "error" }

The implication: your highest-value tenants — the ones generating the most traffic — are the first to hit rate limits. There is no way to configure or raise these limits through the console. For any application where a single tenant can generate bursty traffic (a team of users searching simultaneously, a batch job, an integration partner), this is a hard ceiling you cannot engineer around.

We also observed connection reset by peer errors and broken pipe errors under high concurrency, suggesting infrastructure-level saturation beyond just rate limiting.

Finding 6: Full-Text and Hybrid Search Stability

We tested Turbopuffer's full-text and hybrid search against a 20M-row Wikipedia dataset:

Full-text search cold latency ranged from 640 ms (large tenant, best run) to 1,229 ms (medium tenant), with the large tenant averaging 1,001 ms across multiple runs. Hot QPS was solid: 868-1,688 depending on tenant size.

Hybrid search revealed significant instability:

Large tenant cold latency: 718 ms to 2,446 ms across runs — a 3.4x variance.
Large tenant concurrent queries triggered 429 rate limiting at just 30 workers.
Both small and large tenant concurrent hybrid search had query failures.

The 3.4x variance in cold latency for the same query on the same data means you cannot set reliable SLAs for hybrid search. Combined with the recall issues described in Finding 1, this raises questions about hybrid search readiness for production workloads.

Summary

Here is a side-by-side summary of our performance and cost findings:

Dimension	Turbopuffer	Zilliz Cloud
Search recall (narrow filter)	0.54 — misses ~half of relevant results	0.99+
Cold query, medium tenant	1,127 ms	181 ms
Cold query, large tenant	2,089 ms (up to 4s at p99)	1,021 ms
Write stability	Blocked 3x, up to 7 min	No interruptions
Cost at 600 GB	Storage 8x more expensive; writes/queries charged	Free writes/queries
Rate limiting	Hard per-namespace cap, not configurable	Configurable
Hybrid search latency variance	3.4x (718 ms – 2,446 ms)	Stable

What to Test Before You Commit

If you're evaluating a vector database for production use, here are the performance and cost tests we'd recommend running before making a decision:

Test recall with your actual filter conditions. Run your real multi-tenant filter (tenant ID, user group, permission scope) and measure recall at your actual selectivity levels. If recall drops below 0.95, your search results are unreliable.
Measure cold query latency, not warm. Stop all queries for 30+ minutes, then measure the first query. This is what your users experience after any period of inactivity. Do this at your actual data scale per tenant.
Run your cost model with real tenant distribution. Don't use the calculator with averages. Model your actual tenant size distribution (it's probably a power law) and calculate queried_bytes cost for your largest tenants at your actual query rate.
Test concurrent queries against a single tenant. Simulate your peak traffic for your busiest tenant. Note when you first see 429 errors — that's your per-tenant ceiling.

All benchmark data was collected in December 2025 on AWS us-west-2. Both products were tested on their latest publicly available versions. The 160M-vector multi-tenant benchmark used identical data, configurations, and client hardware for both systems. Raw data, scripts, and detailed logs are available upon request.

In our next companion article, we'll examine the compliance and enterprise readiness dimensions of this evaluation — including delete consistency, GDPR implications, security certifications, and operational tooling. Stay tuned.

Updated on Apr 16, 2026

Yanliang Qiao
Yanliang Qiao is a Senior Quality Assurance Engineer at Zilliz.

Keep Reading

Top 10 Context Engineering Techniques You Should Know for Production RAG

A practical guide to context engineering for production LLM systems, covering RAG, context processing, memory, agents, and multimodal context.

AI Integration in Video Surveillance Tools: Transforming the Industry with Vector Databases

Discover how AI and vector databases are revolutionizing video surveillance with real-time analysis, faster threat detection, and intelligent search capabilities for enhanced security.