Blog
Data Deduplication at Trillion Scale: How to Solve the Biggest Bottleneck of LLM Training

Data Deduplication at Trillion Scale: How to Solve the Biggest Bottleneck of LLM Training

Jul 22, 202510 min read

The LLM Scaling Race—and Its Unseen Cost

LLMs have transformed nearly every facet of modern AI, unlocking new frontiers in content generation, software development, reasoning, and autonomous tool use. And their capabilities show no signs of slowing.

Take the most recent announcements: X.ai’s Grok 4 and Moonshot’s Kimi K2 introduce stronger reasoning capabilities, better tool use, and more coherent generation—all fueled by significantly larger and more diverse training corpora.

The trend is clear: the frontier of capability is being pushed forward by training at an unprecedented scale. Consider the data footprints of recent models:

Model	Release	Parameters	Training Data
Kimi K2	2025	1T	15.5 trillion tokens
Grok 4	2025	~175B	100x more than Grok 2, probably trillion scale
GPT-4	2023	1.8T (est.)	13 trillion tokens
LLaMA 3.1	2024	405B	15 trillion tokens

Kimi K2, for instance, tripled its dataset size in just six months—a growth rate that almost outpaces the internet's expansion. At 15.5 trillion tokens, it's larger than the cumulative contents of every major global library, many times over.

But buried inside this growth is an assumption that more data always equals better performance. In practice, the marginal gains from simply adding more tokens are diminishing and increasingly constrained by data quality.

That brings us to the first—and perhaps most underappreciated—bottleneck in modern LLM training: data duplication.

Data Deduplication: Why It Matters for LLM Training

Modern LLM pretraining datasets are primarily sourced from large-scale web crawls, open repositories, public corpora, and domain-specific documents scraped from the web. As these pipelines grow, redundancy becomes not just common but systemic.

Documents with minor variations (e.g., formatting changes, footers, boilerplate headers) reappear across different domains. Popular pages are mirrored, translated, or reposted on other websites. Codebases and knowledge articles are duplicated across forums, wikis, and archived snapshots. Even structured sources, such as Wikipedia, exhibit repetition in link trails and mirrors.

When this duplicated content flows unchecked into training sets, the consequences are significant:

Compute Inefficiency: Repeated examples provide no new information but consume the same compute resources.
Overfitting Risk: LLMs exposed to repeated phrasing, structure, or content patterns can become overly reliant on these patterns, thereby reducing their generalization capabilities.
Verbatim Memorization: High duplication increases the risk of models memorizing specific sequences, raising safety, privacy, and IP concerns.
Evaluation Leakage: If duplicates exist across the training and validation/test sets, benchmark scores can be artificially inflated, providing a misleading impression of model quality.

In short, duplication isn’t just a nuisance. It is an existential problem for high-scale, high-cost training runs.

One of our enterprise customers—a top-tier LLM provider—encountered exactly this issue. They needed to deduplicate tens of billions of documents before they were ingested. Exact hash matching tools missed near-duplicates. Semantic models were too expensive to run at scale. And traditional data-cleaning stacks simply couldn't meet their time and resource constraints.

This is the context in which deduplication has become mission-critical infrastructure, not a preprocessing afterthought.

An Overview of Deduplication Techniques

There are three dominant strategies for deduplication at scale, each with trade-offs in terms of precision, cost, and feasibility.

Exact Matching: Uses cryptographic hashing to find identical documents. Fast and precise, but misses near-duplicates with minor formatting differences.
Semantic Matching: Leverages vector embedding models to find conceptually similar content. Highly accurate but computationally expensive at scale.
Approximate Matching: Finds near-duplicates using probabilistic algorithms like MinHash LSH and Jaccard similarity. Balances accuracy with computational efficiency—perfect for trillion-token datasets.

With pretraining corpora reaching terabytes or even petabytes, traditional exact matching methods, such as pairwise comparisons, are computationally infeasible. Semantic deduplication adds significant overhead by using embedding models to generate vectors.

We need more innovative approximate methods—like MinHash LSH—that balance recall and precision while keeping costs manageable, making large-scale deduplication practical.

MinHash LSH: Detecting Near-Duplicates in Trillion-Scale Datasets

In the context of large-scale LLM training, efficient deduplication requires a matching algorithm that is not only accurate but computationally feasible at the scale of tens of billions of documents. MinHash LSH (Locality Sensitive Hashing) is purpose-built for exactly this kind of scenario.

MinHash: Scalable Similarity Estimation

MinHash is a probabilistic technique designed to estimate the Jaccard similarity between sets, without computing explicit pairwise intersections. In the context of document deduplication, it acts as a lossy compression mechanism that preserves the similarity structure across massive corpora.

The process works as follows:

Each document is decomposed into a set of shingles, typically fixed-length character or word n-grams.
A series of independent hash functions is applied to these sets.
For each hash function, the minimum resulting value across the shingle set is retained.

This produces a fixed-length MinHash signature for each document. The critical property is this: for any two documents, the probability that a particular hash value is shared in the same position of their signatures approximates their Jaccard similarity.

This dramatically reduces the computational burden for large-scale similarity detection. Instead of comparing full documents, we compare short signature vectors. But there's a scaling problem. Even with this optimization, comparing every document pair remains computationally infeasible at web scale.

Locality Sensitive Hashing: Accelerating Similarity Search

To make MinHash practical for billion-scale corpora, we apply Locality Sensitive Hashing (LSH) on top of the signature vectors. The core idea of LSH is to increase the likelihood of similar documents colliding in at least one hash bucket, without requiring exhaustive comparison.

Here's how it works:

Each MinHash signature is divided into multiple bands, each containing a subset of the signature dimensions.
Each band is independently hashed into a bucket.
If two documents share at least one band that hashes to the same bucket, they are considered candidates for potential duplication.

This banding strategy ensures that documents with high similarity (i.e., many shared MinHash values) are much more likely to collide. By adjusting the number of bands and rows per band, we can trade off between recall (the number of accurate duplicates caught), precision (the number of false positives avoided), and performance.

The result is a scalable, approximate deduplication system that remains tractable even when applied to corpora containing tens of billions of documents.

Integrating MinHash LSH with Milvus and Zilliz Cloud

Traditionally, deduplication is handled by standalone preprocessing pipelines disconnected from the primary retrieval or storage infrastructure. This introduces a range of inefficiencies:

Costly data transfer between the deduplication and vector indexing components.
Duplicated logic for data normalization and shingling.
Difficulty scaling deduplication and retrieval pipelines together.

We approached the problem differently. Recognizing Milvus’s strength as a high-throughput vector database, we asked: What if MinHash LSH were a first-class, natively integrated indexing primitive?

This led to the native integration of MinHash LSH into Milvus 2.6 and Zilliz Cloud (managed Milvus), turning approximate deduplication into a core part of the vector indexing and retrieval workflow.

What This Integration Enables

End-to-end workflow: From ingestion and MinHash signature generation to approximate duplicate detection and downstream semantic retrieval—all within Milvus.
Distributed scale: Built atop Milvus’s cloud-native architecture, LSH indexing scales horizontally across terabytes or even petabytes of data.
Unified APIs: The same API used for semantic embedding search can now also support MinHash-based deduplication queries, making MLOps workflows cleaner and more maintainable.

In our current implementation:

Users generate MinHash signatures externally (e.g., using their preferred shingling and hash strategies).
These signature vectors (typically uint32 arrays) are inserted into Milvus.
LSH indexing narrows the candidate space for approximate duplicate detection using the banding strategy described above.

This design empowers teams to deduplicate training corpora at a massive scale without introducing additional storage layers or disconnected preprocessing logic.

We’ve also extended the underlying API to support workflows like hybrid insertion (semantic and MinHash vectors), dynamic index construction, and batch deduplication queries. These capabilities are still evolving, and we welcome feedback from teams deploying this in production.

The Engineering Challenges of Deduplicating Tens of Billions of Documents with MinHash LSH

Getting MinHash LSH to work in production has been the industry's white whale for years.

The challenge boils down to two brutal requirements:

You need deep expertise in both MinHash and LSH algorithms, as well as the engineering skills to integrate them seamlessly.
Any real-world use case for MinHash LSH involves deduplicating tens of billions, hundreds of billions, or even trillions of data points. This places crushing demands on performance and engineering capabilities that most teams cannot meet.

Here's a perfect example: About a year ago, a leading AI company approached us with a seemingly straightforward request. They needed to deduplicate tens of billions of data points (in a 780-dimensional int32 format), with the ability to quickly spin up services and rapidly process data for deduplication and insertion.

We immediately hit a showstopper: most vector databases default to float32 data formats, but MinHash vectors are collections of uint32 hash values.

At first glance, this seems like a non-issue—float32 can represent uint32 values in most cases, right?

Wrong.

Here's the gotcha: float32 can only represent unsigned integers in the range 0 to 16,777,216, while uint32 covers 0 to 4,294,967,295. If any hash value exceeds 16,777,216, float32 starts losing precision in the least significant bits.

Fortunately, Milvus and Zilliz Cloud's binary vector support elegantly solves this problem.

This might seem like a minor technical detail, but it highlights a crucial point: you need a database designed from the outset to handle diverse data formats, massive scales, and varied enterprise requirements. If you're not building for enterprise-grade scenarios from the start, even tiny compatibility issues like this can turn into customer experience disasters down the road.

But the data format challenge was just the beginning—our customers also demanded extreme performance. During the integration process, the client put it bluntly: "I need to quickly spin up Zilliz Cloud services that can immediately perform high-precision vector deduplication. Every import involves 30GB files with 780-dimensional int32 signature data, and the entire import process must complete in under 15 minutes."

This looks like mission impossible at first glance, but we quickly delivered our answer: forget 15 minutes—we'll get it done in 4.

This performance breakthrough came from two key Milvus optimizations:

First, we implemented multi-file parallel processing that shattered the traditional serial import bottleneck. The system can now simultaneously handle multiple data files, dramatically boosting overall throughput and import speeds.

Second, we integrated dynamic resource allocation that intelligently schedules computational resources based on task complexity and volume. This eliminates resource waste and contention while maximizing utilization. Combined, these optimizations enable Milvus to fully leverage modern hardware capabilities and the concurrent read-write characteristics of cloud storage, delivering near real-time data import experiences.

Solving the import challenge was just step one—how do you handle rapid deployment and computation at massive scale?

Large AI model training scenarios create a perfect storm of demanding requirements. You're dealing with enormous incoming data volumes, massive existing databases, and peak loads that can reach 44,000 vector retrievals per second—the kind of extreme concurrency that crushes most systems. As data continues to flow in and your database grows exponentially, computational demands escalate accordingly, putting relentless pressure on system performance.

The solution requires serious distributed computing muscle. Zilliz Cloud's cloud-native architecture was specifically designed to address these challenges through intelligent workload distribution and elastic scaling.

The Secret Weapon: Cardinal Engine Integration

Looking ahead, MinHash LSH represents just the beginning. We're integrating this capability into Zilliz Cloud's proprietary Cardinal engine, which will further accelerate unstructured data processing across the board.

Cardinal is our next-generation AI-powered vector search engine, built from the ground up with modern C++ and state-of-the-art approximate nearest neighbor search (ANNS) algorithms. The goal is simple: handle more user requests with the same hardware resources.

Algorithm-Level Optimizations: Cardinal delivers extensive performance tuning for core algorithms like IVF and graph indexing, striking the optimal balance between speed and memory efficiency.
Engineering-Level Innovations: The engine features custom memory allocators and intelligent memory pooling, along with a modular component architecture that enables flexible composition of the search pipeline. Each pipeline can be fine-tuned for specific, mission-critical use cases.
Hardware-Specific Optimization: Cardinal includes multiple specialized compute kernels, each hand-optimized for particular hardware platforms and workload patterns.

These comprehensive optimizations enable Cardinal to operate at maximum efficiency around the clock, delivering industry-leading vector search performance. With Cardinal powering Zilliz Cloud, we've achieved 10x performance improvements over open-source Milvus, combined with ultra-fast query speeds and high recall rates. Whether you're processing massive datasets or building applications that demand lightning-fast response times, Cardinal provides the performance foundation for superior user experiences and competitive AI applications.

The Future Is Unstructured—And We're Ready for It

LLM training data deduplication is just the opening act in a much bigger transformation story. IDC predicts that by 2027, unstructured data will explode to nearly 250ZB globally—representing 86.8% of all data in existence. While this data costs significantly more to process and store than structured data, the value locked inside text, images, audio, video, sensor logs, social media content, PDFs, web pages, code repositories, medical imaging, and satellite photos is impossible to ignore.

This creates the defining challenge of our era: how do we efficiently extract value from exponentially growing unstructured data without breaking the bank?

The deduplication capabilities we've built for AI training represent just one piece of this larger puzzle. As unstructured data continues its explosive growth, the same principles—intelligent algorithms, enterprise-scale engineering, and cloud-native performance—will become essential infrastructure for every data-driven organization.

The future belongs to companies that can turn unstructured data chaos into a structured competitive advantage. We're building that future, one algorithm at a time. Ready to join us?

Ready to explore large-scale deduplication for your AI training pipeline? Learn more about Milvus 2.6's MinHash LSH capabilities in our comprehensive documentation, try Zilliz Cloud (managed Milvus) for production workloads, or connect with our engineering team on Discord to discuss your specific use case.

Updated on Aug 04, 2025

Min Tian
Software Engineer at Zilliz

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

Our Journey to 35K+ GitHub Stars: The Real Story of Building Milvus from Scratch

Join us in celebrating Milvus, the vector database that hit 35.5K stars on GitHub. Discover our story and how we’re making AI solutions easier for developers.

Building RAG Pipelines for Real-Time Data with Cloudera and Milvus

explore how Cloudera can be integrated with Milvus to effectively implement some of the key functionalities of RAG pipelines.

Vector Databases vs. NoSQL Databases

Use a vector database for AI-powered similarity search; use NoSQL databases for flexibility, scalability, and diverse non-relational data storage needs.