Blog
Safe RAG with HydroX AI and Zilliz: PII Masking for Responsible GenAI

Safe RAG with HydroX AI and Zilliz: PII Masking for Responsible GenAI

Nov 07, 20243 min read

With the rapid growth of artificial intelligence, a huge amount of unstructured data—like web content and private information—has been used to train AI models and empower GenAI applications such as Retrieval Augmented Generation (RAG). Protecting Personally Identifiable Information (PII) has become essential for responsibly using this data, particularly during model training and inference.

To meet this critical need, Zilliz, the creator of the open-source Milvus vector database, has partnered with HydroX AI to introduce the PII Masker, an advanced tool designed to enhance data privacy in AI applications.

The Importance of PII Safety and GenAI Security

Generative AI (GenAI) models have opened new possibilities in content creation, question answering, and information analysis, but they also bring specific security challenges. Since GenAI models are trained on massive, diverse datasets, they can unintentionally learn and reproduce sensitive PII embedded within this data. This risk becomes especially concerning when private data could be unintentionally revealed in the model’s output.

Ensuring data safety in GenAI workflows is essential—not only to help organizations stay compliant but also to improve model performance by reducing data leaks and minimizing hallucinations, where models produce incorrect or misleading information.

PII Masker adds an important layer of security to GenAI models by filtering out PII before data is stored in vector databases like Milvus or Zilliz Cloud (the managed version of Milvus). This step significantly reduces the risk of exposing sensitive information, particularly when using vector databases to store unstructured data and their high-dimensional vector representations for similarity searches and semantic understanding in GenAI applications.

Vector Databases and GenAI: A Perfect Match with a Need for Safety

Vector databases like Milvus are the backbone of many GenAI applications, efficiently storing, indexing, and retrieving vector embeddings. In scenarios like image, text, and video search, Milvus enables GenAI models to operate with grounded information to generate high-quality answers, offering a scalable solution for AI-driven applications across industries, from healthcare to finance. However, vector embeddings can often contain traces of PII, which are challenging to detect with traditional methods, making innovative solutions for data privacy essential for downstream applications.

PII Masker plays a pivotal role here. Organizations can ensure privacy at every layer of their data pipeline by anonymizing or masking PII using the PII Marker before data reaches the vector database. PII Masker has seamlessly integrated with both Milvus and Zilliz Cloud, allowing users to confidently build GenAI applications while keeping their knowledge bases and RAG applications compliant with privacy regulations and protecting user data.

Key Features of PII Masker for AI Model Safety

Developed by HydroX AI in collaboration with Zilliz, PII Masker automatically detects and masks PII with high precision. Using the DeBERTa-v3 NLP model, PII Masker identifies sensitive information and provides structured output for easy handling. With support for up to 1,024 tokens, PII Masker efficiently processes large datasets while safeguarding PII. This capability helps prevent RAG and various GenAI applications from accidentally exposing sensitive information in responses, reducing data leakage risks and ensuring queries remain private.

The Future of PII Masker

While PII Masker already delivers substantial benefits, HydroX AI is committed to advancing its capabilities. Here are two areas of evolution on the horizon:

Expanded Language Support: As AI applications grow globally, ensuring PII safety across multiple languages is essential. Future versions of PII Masker will broaden its language capabilities to serve diverse data pools, making it a more versatile tool for international organizations.
Improved Detection of Contextual PII: Currently, the PII Masker detects explicit PII such as names, addresses, and phone numbers. In future iterations, it aims to enhance its ability to identify and mask contextually implied PII—information that might not be explicitly sensitive but could reveal identity when combined with other data.

Getting Started with PII Masker

For developers interested in implementing RAG applications that protect PII, PII Masker offers a straightforward API designed for seamless integration into existing workflows. By cloning the repository, installing dependencies, and executing a few lines of code, developers can begin masking sensitive data efficiently. This collaboration between Zilliz and HydroX AI facilitates the creation of AI applications that respect user privacy and adhere to global regulations.

Zhuo Li, Founder and CEO of HydroX AI, highlights the significance of this initiative: "Incorporating PII Masker into AI workflows ensures that sensitive information is protected, enabling organizations to innovate confidently while upholding the highest standards of data privacy."

To learn more about how PII Masker can enhance data protection while advancing AI capabilities, visit the PII Masker GitHub repository or check out our step-by-step guide on building RAG with PII Masker and Milvus.

Updated on Mar 31, 2025

Jiang Chen
Jiang is currently Head of Ecosystem and Developer Relations at Zilliz. He has years of experience in data infrastructures and cloud security. Before joining Zilliz, he had previously served as a tech lead and product manager at Google, where he led the development of web-scale semantic understanding and search indexing that powers innovative search products such as short video search. He has extensive industry experience handling massive unstructured data and multimedia content retrieval. He has also worked on cloud authorization systems and research on data privacy technologies. Jiang holds a Master's degree in Computer Science from the University of Michigan.
Victor Bian
Chief of Staff, HydroX AI

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

Vector Databases vs. Hierarchical Databases

Use a vector database for AI-powered similarity search; use a hierarchical database for organizing data in parent-child relationships with efficient top-down access patterns.

DeepSeek vs. OpenAI: A Battle of Innovation in Modern AI

Compare OpenAI's o1 and o3-mini with DeepSeek R1's open-source alternative. Discover which AI model offers the best balance of reasoning capabilities and cost efficiency.

3 Key Patterns to Building Multimodal RAG: A Comprehensive Guide

These multimodal RAG patterns include grounding all modalities into a primary modality, embedding them into a unified vector space, or employing hybrid retrieval with raw data access.