PII Masker and Zilliz Cloud Integration
PII Masker and Zilliz Cloud integrate to build privacy-compliant RAG applications, combining PII Masker's open-source tool for detecting and masking Personally Identifiable Information with Zilliz Cloud's high-performance vector database for secure, scalable retrieval without exposing sensitive data.
Use this integration for FreeWhat is PII Masker
PII Masker, developed by HydroX AI, is an advanced open-source tool designed to protect sensitive data by leveraging cutting-edge AI models. It detects and masks or replaces Personally Identifiable Information (PII) including names, addresses, phone numbers, email addresses, and ID numbers. Using the DeBERTa-v3 NLP model with support for up to 1,024 tokens, PII Masker efficiently processes large datasets while safeguarding sensitive information — particularly valuable for AI applications that handle customer service logs, medical records, and financial documents.
By integrating with Zilliz Cloud (fully managed Milvus), PII Masker adds a separate layer of security by filtering out or anonymizing PII before data is ingested into the vector database, enabling organizations to build high-performance and scalable RAG systems that handle complex, large-scale datasets without compromising privacy or compliance with privacy regulations.
Benefits of the PII Masker + Zilliz Cloud Integration
- Privacy-compliant RAG: PII Masker automatically detects and masks sensitive information before data is stored in Zilliz Cloud, ensuring RAG applications never expose personal data in LLM responses.
- AI-powered PII detection: Using the DeBERTa-v3 model, PII Masker identifies sensitive data with high precision across names, addresses, phone numbers, emails, and ID numbers — more reliable than rule-based approaches.
- Layered security architecture: The integration creates a two-layer approach where PII Masker filters sensitive data before ingestion and Zilliz Cloud provides secure vector storage and retrieval, ensuring end-to-end privacy protection.
- Compliance support: Organizations handling customer data can meet privacy regulations while still building powerful AI applications, as PII is masked before it enters the vector database and LLM pipeline.
How the Integration Works
PII Masker serves as the data privacy layer, processing text data before it enters the RAG pipeline. It uses the DeBERTa-v3 NLP model to detect PII entities (names, addresses, phone numbers, etc.) and replaces them with mask tokens (e.g.,
[B-NAME],[B-PHONE_NUM],[I-STREET_ADDRESS]), ensuring sensitive information is removed before embedding and storage.Zilliz Cloud serves as the vector database layer, storing and indexing the masked text embeddings for fast similarity search. Since data is already anonymized before ingestion, the vector database only contains privacy-safe content that can be retrieved without risk of PII exposure.
Together, PII Masker and Zilliz Cloud create a secure RAG pipeline: raw text containing PII is processed by PII Masker to replace sensitive information with mask tokens, the masked text is then embedded and stored in Zilliz Cloud, and when users query the system, the LLM generates responses based on the anonymized context — effectively preventing any PII from being exposed in the output.
Step-by-Step Guide
1. Get Started with PII Masker
Clone the repository and download the model:
$ git clone https://github.com/HydroXai/pii-masker-v1.git $ cd pii-masker-v1/pii-maskerDownload model from
https://huggingface.co/hydroxai/pii_model_weight, and replace it with files in:pii-masker/output_model/deberta3base_1024/2. Install Dependencies
$ pip install --upgrade pymilvus openai requests tqdm datasetSet up the OpenAI API key:
$ export OPENAI_API_KEY=sk-***********3. Prepare and Mask the Data
Create sample text lines containing PII, then mask them using PII Masker:
from model import PIIMasker masker = PIIMasker() text_lines = [ "Alice Johnson, a resident of Dublin, Ireland, attended a flower festival at Hyde Park on May 15, 2023. She entered the park at noon using her digital passport, number 23456789...", "Hiroshi Tanaka, a businessman from Tokyo, Japan, went to attend a tech expo at the Berlin Convention Center on November 10, 2023...", ] masked_results = [] for full_text in text_lines: masked_text, _ = masker.mask_pii(full_text) masked_results.append(masked_text) for res in masked_results: print(res + "\n")4. Prepare the Embedding Model
Initialize the OpenAI client and define the embedding function:
from openai import OpenAI openai_client = OpenAI() def emb_text(text): return ( openai_client.embeddings.create(input=text, model="text-embedding-3-small") .data[0] .embedding ) test_embedding = emb_text("This is a test") embedding_dim = len(test_embedding)5. Load Masked Data into Milvus
Create a collection and insert the masked, embedded data:
from pymilvus import MilvusClient from tqdm import tqdm milvus_client = MilvusClient(uri="./milvus_demo.db") collection_name = "my_rag_collection" if milvus_client.has_collection(collection_name): milvus_client.drop_collection(collection_name) milvus_client.create_collection( collection_name=collection_name, dimension=embedding_dim, metric_type="IP", consistency_level="Strong", ) data = [] for i, line in enumerate(tqdm(masked_results, desc="Creating embeddings")): data.append({"id": i, "vector": emb_text(line), "text": line}) milvus_client.insert(collection_name=collection_name, data=data)As for the argument of
MilvusClient: Setting theurias a local file, e.g../milvus.db, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file. If you have large scale of data, you can set up a more performant Milvus server on Docker or Kubernetes. If you want to use Zilliz Cloud, the fully managed cloud service for Milvus, adjust theuriandtoken, which correspond to the Public Endpoint and API Key in Zilliz Cloud.6. Build RAG and Verify PII Protection
Search for relevant context and generate a response:
question = "What was the office address of Hiroshi's partner from Munich?" search_res = milvus_client.search( collection_name=collection_name, data=[emb_text(question)], limit=1, search_params={"metric_type": "IP", "params": {}}, output_fields=["text"], ) context = "\n".join( [res["entity"]["text"] for res in search_res[0]] ) SYSTEM_PROMPT = """ Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided. If there are no useful information in the snippets, just say "I don't know". AI: """ USER_PROMPT = f""" Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags. <context> {context} </context> <question> {question} </question> """ response = openai_client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": USER_PROMPT}, ], ) print(response.choices[0].message.content)Since the PII has been replaced with masks, the LLM cannot access the sensitive information in the context and answers: "I don't know." This effectively protects the privacy of users.
Learn More
- Build RAG with Milvus + PII Masker — Official Milvus tutorial for building RAG with PII Masker
- Building Safe RAG with PII Masker and Milvus — Zilliz blog on safe RAG with PII masking
- PII Masker GitHub Repository — PII Masker source code
- HydroX AI Website — HydroX AI official website
- Milvus GitHub Repository — Milvus source code and community resources