The Cost of Consequence: What No One Tells You About Serverless Vector Databases

In 2023, DHH published a blog post titled “Why We’re Leaving the Cloud,” and it sent shockwaves through the engineering community. 37signals announced that they were moving their entire infrastructure off AWS and back to self-hosted servers, expecting to save more than $2 million per year.
His core argument was simple: “Renting computers is mostly a bad deal for medium-sized companies with stable growth.”
The debate that followed wasn’t really about whether cloud computing is good or bad. What the story actually exposed was a deeper economic trap: pay-as-you-go pricing often looks cheap at first, but becomes surprisingly expensive once systems reach scale.
Today, something very similar is starting to play out in the vector database world.
Testing Turbopuffer, a Serverless Vector Databases
Serverless vector databases have a compelling pitch. Pay only for what you use. No infrastructure to manage. Zero upfront commitment. Spin up in minutes, scale to zero when idle.
We wanted to believe it. So we tested it.
- The context: we were building a code assistant product, roughly following Cursor's approach — indexing a large codebase, serving semantic search queries, and supporting multi-tenant access patterns.
- Target scale: ~10 million vectors, 768 dimensions, 1,000 tenants, ~40 QPS average.
We ran a full evaluation of Turbopuffer, one of the most talked-about serverless vector databases, because it promised exactly what we needed at a price we liked.
Here's what we found.
The Pricing Calculator Is Optimistic. Very Optimistic.
We started with Turbopuffer's pricing calculator. Plugged in our numbers. Got back an estimate of $220/month — storage at ~$90, queries at ~$130. For 10M vectors and our query load, this was genuinely exciting. A dedicated cluster on comparable hardware would run $800–$1,200/month. We'd save 75%.
Then we ran it with real data.
Actual bill: $1,000+/month. Ten times higher than estimated.
The anomaly was queried_bytes — it consumed 70% of the total bill and was completely out of proportion to what we expected.
Turbopuffer's queried_bytes isn't charged based on the data your query actually touches. It's charged based on the total size of the namespace being searched — all vectors plus all scalar attributes, every time you issue a query. It doesn't matter if you're doing a top-10 query that would logically scan a fraction of the dataset. The billing unit is the whole namespace.
The calculator assumes your tenants are roughly uniform in size. Ours weren't.
In a real multi-tenant deployment, tenant size follows a power law. One large tenant holding 10% of the total data was generating a disproportionate share of queries — and each query was billed against the full namespace size. A query against a large tenant costs 10 times that of a small tenant, not because it's doing 10 times the work, but because of how billing is structured.
The practical result: $100 became $1,000. That's the delta between Turbopuffer's 8CU equivalent and Zilliz Cloud's 8CU dedicated instance, Pinecone P2.x8, or OpenSearch 16-core 128G managed — all of which comfortably handle 1,000 QPS with predictable billing.
The tipping point is lower than you think. At roughly 1 sustained QPS against a large dataset, the per-request serverless model can cross the cost of a dedicated cluster. Once you have real customers with uneven usage, you're likely past that threshold.
And here's the irony: at small scale — a few hundred thousand vectors, low traffic, prototyping — you don't need Turbopuffer's architecture. Zilliz Cloud's free tier handles that without the tradeoffs we're about to describe.
The S3 Architecture Has a Performance Ceiling You Can't Engineer Around
We want to be fair: Turbopuffer's baseline performance is impressive for what it is. In a warm state — recently accessed namespaces cached in memory — we measured p99 latencies around 30ms. For a serverless product, that's genuinely good.
The problem is the cold state.
Cold p99 in our tests: up to 4 seconds. The distribution was wide and the tail was unpredictable. Sometimes 800ms. Sometimes over 3 seconds. We couldn't reliably characterize what drove the variance.
This isn't a bug. It's physics.
Turbopuffer stores vector indexes in S3. Object storage is designed for throughput and durability, not for the random-access, low-latency reads that vector search requires. Serving a cold query means:
- Multiple S3 GET requests with first-byte latencies of 10–100ms each
- Deserializing and reconstructing the index before the first search can run
- No way to predict when a namespace was last accessed
The only mitigation is cache. If your traffic is consistent enough to keep namespaces warm, cold starts become rare. But if your application has:
- Thousands of small tenants accessing their data infrequently
- Bursty traffic that scales to zero between sessions
- Any user-facing SLA on query latency
...then cold starts aren't an edge case. They're the steady state for a meaningful fraction of your users.
In a code assistant context, this means a developer opens their IDE after a few hours away, triggers a semantic search, and waits 3 seconds before anything happens. Not because the model is slow. Because the vector index is cold.
This is the Cost of Consequence: the bill for this architectural choice doesn't appear in the pricing calculator. It appears in your product's p99 latency, in support tickets, and in the churn of users who tried it once and found it slow.
Zilliz Cloud takes a different approach: a three-tier storage architecture with S3****, local NVMe disk, and memory. S3 serves as the durable persistence layer, while active index files are cached on compute nodes' local NVMe drives and memory. NVMe random-read latency is on the order of 0.1ms — 100–1,000× faster than S3. Queries that hit the disk cache exhibit latency characteristics close to pure in-memory search, not object storage. Scaling capacity doesn't evict your cache. Cold starts become exceptional rather than routine — and when they occur, the impact is measured in milliseconds, not seconds.
| Metric | Turbopuffer | Zilliz Cloud |
|---|---|---|
| Cold start p99 latency | ~4s | <100ms |
| Warm p99 latency | ~30ms | ~20ms |
| Cold start control | No control over cold starts | Disk cache ensures predictable startup |
The SPFresh Index: A Scalability Ceiling Imposed by S3
This brings us to a deeper architectural constraint—and one of the hardest limitations to work around in Turbopuffer’s current design.
The choice of vector index is, to a large extent, dictated by the storage layer. Graph-based indexes like HNSW deliver exceptional performance, but they require the entire graph to reside in memory with microsecond-level random access to graph nodes. When your index lives in S3, that prerequisite disappears — S3's random-read latency is orders of magnitude higher than memory, making graph traversal prohibitively slow.
This is why Turbopuffer uses the SPFresh index. SPFresh is purpose-built for SSD and object-storage environments: it partitions the vector space into segments and loads only a handful of partitions per query, minimizing random I/O. In a storage-constrained setting, this is a reasonable engineering choice.
But it introduces two structural tradeoffs.
Recall degrades as data changes
SPFresh optimizes its index structure based on the data distribution at build time. Over time, as vectors are inserted or deleted, the actual data distribution gradually diverges from what the index expects. As this divergence grows, the search process begins to miss relevant vectors.
In other words, recall is not a fixed property of the system—it degrades silently over time until the index is rebuilt. For workloads with frequent updates—such as RAG knowledge bases or real-time recommendation systems—this creates an ongoing maintenance burden. Periodic full index rebuilds become necessary just to maintain baseline search quality.
IVF-style indexes have a hard throughput ceiling
The second limitation is throughput. IVF-style indexes—including SPFresh—have a fundamental throughput ceiling that is significantly lower than graph-based approaches.
HNSW can sustain tens of thousands of QPS on a single node under typical production workloads. IVF-family indexes, constrained by partition loading and scanning costs, typically reach their limits in the hundreds to low thousands of QPS.
This is a number that benchmarks often overlook but that matters enormously in production. HNSW can easily sustain tens of thousands of QPS on a single node. IVF/SPFresh indexes, constrained by partition loading and scanning costs, typically top out in the hundreds to low thousands of QPS.
The difference is roughly an order of magnitude.
This gap becomes critical as systems scale. If an application grows from 100 QPS to 1,000 QPS, an HNSW-based system can often scale linearly. An SPFresh-based system, however, may require 10× more nodes to maintain similar latency. The per-node QPS ceiling becomes a hard constraint that additional hardware cannot overcome efficiently.
As a result, both compute cost and operational complexity grow much faster than expected.
How Zilliz Cloud approaches the problem
Zilliz Cloud takes a different approach by combining HNSW with Cardinal quantization.
Cardinal is a proprietary multi-threaded modern C++ template-based vector search engine. It addresses the traditional memory footprint challenges of graph indexes. At the same bit width, it achieves higher recall than conventional quantization methods, allowing more aggressive compression while preserving search quality. This enables significantly larger vector datasets to remain memory-resident.
On the storage side, Zilliz Cloud uses an S3 + NVMe+memeory caching architecture. Frequently accessed index data stays on local NVMe storage and memory—delivering latency far below object storage—while colder data is still safely persisted in S3.
The write path is also designed for continuous updates. New data first enters a growing segment, where it becomes immediately searchable. In the background, the system performs compaction and incrementally builds HNSW indexes as segments are sealed.
The result is a system that supports real-time writes, predictable recall, and high throughput—without requiring periodic full index rebuilds.
Turbopuffer Is a Search Index, Not a Database
This is the hardest section to write fairly, because Turbopuffer is built to be a fast search layer, not a general-purpose database. The problem is that the distinction matters enormously in production, and it's easy to miss until you've already built on top of it.
Recall Collapses Under Filtering
We ran a benchmark: 1,000 queries, top-100 results, against 10M vectors. Standard scenario, baseline numbers looked fine.
Then we added a simple range filter — filtering by tenant ID, returning only results belonging to a specific customer. Multi-tenant filtering is not an exotic requirement. It's a baseline expectation for any shared-deployment AI product.
Results:
| Filter selectivity | Turbopuffer recall | Comparable VDBs |
|---|---|---|
| id > 50% (broad) | 0.78 | 0.99+ |
| id > 90% (moderate) | 0.69 | 0.99+ |
| id > 99% (narrow) | 0.54 | 0.99+ |
At the narrowest filter — querying a small tenant — recall dropped to 0.54. For every two relevant results, Turbopuffer was missing one.
The architectural reason: Turbopuffer applies filters as post-processing on ANN search results rather than building filter-aware indexes. When the filter is selective, the ANN candidate pool doesn't contain enough matching documents, and recall falls sharply. Most production vector databases address this through hybrid filtering or pre-filtering strategies that maintain recall across selectivity ranges. It's a known hard problem with known solutions.
What made this worse: Turbopuffer provides no parameters to tune recall. No ef_search equivalent, no way to expand the candidate pool, no configuration to trade latency for accuracy. We tried increasing top_k to retrieve more candidates and filter down — but at the maximum allowed value of 1,200, the system returned only ~500 results.
Recall and result count both failed simultaneously. A recommendation system might survive low recall — users don't see the results you missed. But a search box or RAG pipeline where the user sees an empty page, or an assistant that silently has half the context it should — that fails users visibly.
How Zilliz Cloud addresses this problem
First: Cardinal, a proprietary quantization algorithm. Traditional vector quantization forces a fixed tradeoff between compression ratio and recall. As compression becomes more aggressive, the precision of vector representations decreases, and recall inevitably drops. Cardinal significantly improves this tradeoff. By applying more precise quantization techniques, it achieves higher recall at the same bit width compared to conventional approaches. In practice, this allows far more vectors to fit into memory while maintaining high search quality.
Second: filter-aware indexing. Filtered queries are notoriously difficult to optimize in vector search systems. Many engines treat filtering as a post-processing step: first retrieve candidate vectors, then apply filters afterward. This approach works reasonably well for high-cardinality datasets but quickly breaks down for multi-tenant workloads where filter selectivity varies dramatically.
Zilliz Cloud takes a different approach. Filter-condition distributions are incorporated during index construction, allowing the system to adapt search strategies based on how data is actually partitioned. Combined with dynamic query routing, this ensures that recall remains above 90% across a wide range of filter cardinalities—whether querying a tenant with millions of vectors or one with only a few hundred.
Consistency Isn't What You'd Expect
We also tested delete_by_filter, which removes a subset of documents.
The API returned success immediately. A subsequent query confirmed the deleted documents weren't in results. Promising.
Then we noticed the result counts were wrong. We'd requested top-100. We were getting 48 results. Then 64. Then 69. The count crept upward over the next 60 minutes before stabilizing at 100.
What happened: the delete removed documents from results (soft delete) but didn't trigger an index rebuild. The underlying index still reflected the pre-delete state. The index converged over time, with no notification, no progress indicator, no way to force a rebuild.
For a recommendation system, eventual consistency with an unknown convergence window might be acceptable. For a search product or RAG pipeline, it isn't. Your users see incomplete result sets. Your downstream systems read wrong counts. And you have no way to know when the system has caught up.
A real database makes consistency guarantees explicit. You know when a write is committed and when it's visible. What we observed was eventual consistency with an indeterminate convergence time — a design choice, but one with real consequences for applications that depend on accurate counts and complete result sets.
Dedicated or Serverless Databases? It's a Choice, Not an Answer
Turbopuffer offers a single deployment model: serverless. In some situations, that’s a strength. In others, it becomes a limitation.
Zilliz Cloud supports both serverless and dedicated deployments—not to add product complexity, but because these two scenarios represent fundamentally different operational needs.
- Serverless works well for early exploration, unpredictable workloads, and applications that can tolerate higher p99 latency. It removes the need for infrastructure management, offers pay-per-use billing, and allows teams to get started in minutes. If a product is still in validation or queries are infrequent, the low operational overhead of serverless can provide real value.
- Dedicated deployments, on the other hand, are designed for production systems: multi-tenant workloads, sustained query traffic, and environments with SLA commitments. Dedicated compute resources provide predictable p99 latency and capacity planning. Costs scale with actual infrastructure usage rather than indirect factors such as namespace size, which makes long-term cost behavior easier to reason about.
Both deployment modes share the same underlying query and indexing engine. Features such as Cardinal quantization and filter-aware indexing are available in both environments, ensuring consistent search quality and performance regardless of deployment model.
In practice, serverless and dedicated are not competing answers to the same problem. They serve different phases of a system’s lifecycle—from early experimentation to large-scale production workloads.
The Cost of Consequence, Quantified
Let’s return to the question we started with: why does pay-per-use pricing so often end up costing more than expected?
Because when engineers evaluate infrastructure, we tend to focus on the unit price on the pricing page, rather than the actual bill determined by usage patterns, architectural constraints, and operational side effects.
DHH reached this conclusion after years of paying $180K per month in AWS bills. By the time the realization came, the cost had already been paid.
Our goal here is simpler: help teams evaluating vector databases reach that realization before they scale into it.
Below is a simplified snapshot from our evaluation.
| Dimension | Turbopuffer (our test) | Zilliz Cloud |
|---|---|---|
| Estimated monthly cost | $220 | ~$400 (8CU dedicated) |
| Actual cost at scale | $1,000+ | Predictable |
| Cold query p99 | Up to 4s | <50ms (global cache) |
| Warm query p99 | ~30ms | ~20ms |
| Recall (narrow filter) | 0.54 | 0.99+ |
| Post-delete consistency | ~1 hour to propagate | Immediate |
| Recall tunability | None | Full parameter control |
But infrastructure cost isn’t just about money. It also includes the operational consequences that appear once a system is running in production:
- Cold-start p99 spikes, and on-call incidents that are hard to explain from dashboards
- Write visibility delays, with users reporting that documents they just uploaded don’t appear in search
- Index degradation over time, and RAG answer quality mysteriously declining as data changes
- SPFresh’s QPS ceiling, and discovering you need 10× more nodes once traffic grows
- Recall collapse under narrow filters, producing worse results in multi-tenant workloads
- Eventual consistency windows, and the debugging time spent reconciling inconsistent result counts
- Tenant-size billing multipliers, and explaining to finance why the infrastructure bill is 10× higher than the estimate
All of these costs are real.
All of them show up in production.
And none of them appear on a pricing page.
Choosing Infrastructure Like It's a Production Decision
The pricing calculator shows what the infrastructure costs. It doesn't show what you're implicitly agreeing to: how cache works, how consistency is handled, what happens to recall when filters get selective, and what your users experience when a namespace goes cold.
Test infrastructure at production scale before you commit to it. Measure p99, not just p50. Test with realistic access patterns — uneven tenant distribution, real filter conditions, real QPS variance. Measure the things your users will notice.
For teams building multi-tenant AI applications that need predictable costs, consistent recall across filter conditions, and latency guarantees — Zilliz Cloud is worth evaluating. The free tier covers prototyping. Dedicated clusters give you the guarantees that serverless architectures, by design, can't provide.
The cheap option isn't always the expensive one. But the one that surprises you in production usually is
*We'll publish our full benchmark methodology and raw data separately. If you've run similar tests with different results, we'd like to hear about it.
- Testing Turbopuffer, a Serverless Vector Databases
- The Pricing Calculator Is Optimistic. Very Optimistic.
- The S3 Architecture Has a Performance Ceiling You Can't Engineer Around
- The SPFresh Index: A Scalability Ceiling Imposed by S3
- Turbopuffer Is a Search Index, Not a Database
- Dedicated or Serverless Databases? It's a Choice, Not an Answer
- The Cost of Consequence, Quantified
- Choosing Infrastructure Like It's a Production Decision
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for FreeKeep Reading

How to Improve Retrieval Quality for Japanese Text with Sudachi, Milvus/Zilliz, and AWS Bedrock
Learn how Sudachi normalization and Milvus/Zilliz hybrid search improve Japanese RAG accuracy with BM25 + vector fusion, AWS Bedrock embeddings, and practical code examples.

Introducing Zilliz MCP Server: Natural Language Access to Your Vector Database
The Zilliz MCP Server enables developers to manage vector databases using natural language, simplifying database operations and AI workflows.

Proactive Monitoring for Vector Database: Zilliz Cloud Integrates with Datadog
we're excited to announce Zilliz Cloud's integration with Datadog, enabling comprehensive monitoring and observability for your vector database deployments with your favorite monitoring tool.
