Blog
Insights into LLM Security from the World’s Largest Red Team

Insights into LLM Security from the World’s Largest Red Team

Jan 17, 20258 min read

What if the biggest threat to AI wasn't malicious code? It may be cleverly crafted text. The Gandalf project, a gamified approach to AI security, exposed how easily large language models (LLMs) can be manipulated through prompt injection. Presented by Max Mathys, an ML engineer from Lakera AI, at a recent Zilliz Unstructured Data meetup, this unique experiment in AI security transformed a simple game into the world's largest red team. It attracted hundreds of users and generated over 40 million prompts.

The core challenge was to extract a secret password from an AI named Gandalf using text-based interactions. This seemingly simple game provided a practical, real-world example of adversarial machine learning, where attackers intentionally try to deceive or manipulate AI models.

The project's findings showed the critical need for powerful AI security measures. Many user prompts were successful attacks that bypassed Gandalf's defenses, highlighting how vulnerable AI systems can be. As LLMs become increasingly integrated into our daily lives, the importance of building resilient defenses against prompt injection cannot be overstated.

In this blog, we’ll recap the main points from Max’s talk and discuss the significance of prompt engineering and its substantial effect on LLMs. We will also go over how the Gandalf project revealed LLMs' vulnerabilities to adversarial attacks. Additionally, we will address the role of vector databases in AI security.

The Gandalf Game: A Playground for Prompt Injection

The Gandalf game was designed as a text-based challenge where users were tasked with extracting a secret password from an AI. The game consisted of eight levels of increasing difficulty, each introducing new defenses against attacks. Users interacted with the AI using text prompts, attempting to bypass security measures and reveal the hidden password.

Figure: The interface of the Gandalf project

At its core, the Gandalf game was designed to test prompt injection attacks. These attacks involve users crafting specific prompts to manipulate the behavior of the LLM. Instead of directly exploiting coding vulnerabilities, users tried to "trick" the AI into disclosing the password by carefully wording their queries. This approach allowed the game to explore various adversarial techniques that take advantage of the LLM's language understanding capabilities.

Figure: The Gandalf game

The game attracted hundreds of thousands of players who submitted over 40 million prompts. This volume of real-world data provided a unique opportunity to study AI vulnerabilities in practice. Some users built AI agents to play the game, showing how advanced attack methods have become. About 10% of the prompts successfully tricked the AI, highlighting real-world security risks.

Data Collection and the Scale of the Challenge

The Gandalf project amassed a large dataset, collecting over 40 million prompts and guesses from over 1 million users. The data reflected a high level of diversity, with users using over 68 languages. This means prompt injection is not limited to English and can occur across different languages.

Figure: Gandalf data stats

This project faced significant technical challenges, such as server overloads and rate-limiting by the OpenAI API. The team had to implement load balancing across different OpenAI accounts to manage these issues. These technical difficulties highlight the demands of deploying a real-world AI security test.

The dataset included numerous attacks, such as:

Unicode Attacks: Use emojis and special characters to confuse the LLM.
Semantic Change Attacks: Shift the focus to another task, indirectly tricking the LLM into revealing the password.
Leet Speak: Replace letters with numbers and symbols to obscure intent.
Programming Techniques: Use technical jargon or code to mislead the LLM.
Indirect Attacks: Manipulate the LLM judge rather than targeting the LLM directly.
Direct Attacks: Simply request the password outright.

Figure: Attack examples

Analyzing Attack Methods Using Vector Embeddings

Embedding-based approaches were used to categorize and analyze the numerous attacks. Instead of manually reviewing each prompt, vector embeddings were used to group similar attacks and identify patterns.

Figure: Attack groups

Vector embeddings are numerical representations of data points in a high-dimensional space. In this space, similar data points are positioned close to one another, while dissimilar data points are located farther apart. Essentially, text prompts are converted into vector representations, where each number corresponds to a specific feature of the text. Several specific types of attacks, such as unicode attacks, leet speak, and semantic change attacks, can be identified using vector embeddings.

The use of vector embeddings allows for efficient similarity searches. A vector database, like Milvus, is built to manage large datasets of unstructured data, like text, via vector representations. This process facilitates quick and accurate retrieval of relevant information through efficient similarity search and analysis.

There are three main types of embeddings, each with advantages in understanding relationships and managing computational resources.

Dense embeddings: Represent data points with most non-zero elements, capturing finer details, but they are less storage efficient. Models like CLIP and BERT generate dense vector embeddings.
Sparse embeddings: There are high-dimensional vectors with mostly zero elements. The non-zero values indicate the importance of specific data points. This makes them memory efficient and suitable for high-dimensional sparse data, such as word frequencies.
Binary embeddings: These embeddings store information using only 1s and 0s. This makes storage and retrieval efficient, though with some loss of precision.

Defense Strategies: From Prompt Engineering to Text Classifiers

Various defense strategies are used against prompt injection attacks, ranging from simple prompt engineering techniques to complex methods like text classifiers. These strategies can be seen as different "levels" in the game, each designed to make it progressively harder for users to extract the secret password.

These strategies can be broadly categorized into three types: prompt engineering-based defenses, LLM-based defenses, and text classifier defenses.

Prompt Engineering-Based Defenses

Early game levels used basic prompt engineering techniques. These defenses were relatively simple and designed to be easily bypassed. For instance, the first level had no defenses, while later levels used simple filtering to block direct questions such as "What is the password?" However, these prompt-based defenses remain susceptible to prompt injection attacks.

LLM-Based Defense (LLM Judge)

The intermediate level introduced a more advanced approach using an LLM judge. In this setup, both the user's prompt and the model's response were passed to another LLM, which acted as a judge. The judge assessed the entire interaction and determined whether the user was attempting a prompt injection attack.

The LLM judge was meant to catch subtle manipulations but was still vulnerable to prompt injection attacks. Bypassing both the original model and the LLM judge showed that the defense system had weaknesses and wasn’t completely reliable.

Figure: LLM-based defense

Text Classifier Defense

The last levels used a different technology, a binary text classifier, to address the shortcomings of the LLM-based defense. This classifier acted as a “secure layer” between the user and the LLM.

The text classifier was not susceptible to prompt injection attacks. It was designed to analyze the user's prompt and the model's output. This determines whether an attack was being attempted without being influenced by the semantic content of the text.

Defense using text classifier

The Role of Vector Databases in AI Security

Vector databases play an important role in improving AI security by providing efficient storage, indexing, and retrieval of vector embeddings. These embeddings enable various security applications, such as analyzing attack patterns, detecting anomalies, and improving the performance of security models.

Efficient Storage and Retrieval of Embeddings: Vector databases such as Milvus and Zilliz Cloud (the managed version of Milvus) are designed to handle large volumes of vector embeddings. These databases use advanced indexing algorithms to enable fast similarity searches, which are essential for real-time security applications.

Quickly retrieving similar embeddings is important in security tasks such as identifying known attack patterns. Vector databases use techniques like Approximate Nearest Neighbor (ANN) search to perform these retrievals with low latency.
These databases also support different types of indexes, like FLAT, IVF_FLAT, HNSW, and SCANN, which optimize the search for different kinds of datasets and use cases.

Improving Security Model Performance: Vector databases can improve the performance of AI security models. Using vector embeddings of known attacks helps security systems identify and respond to threats more effectively. This is especially important when the nature of attacks is constantly changing.

The Gandalf project showed that using a text classifier as a defense mechanism can be a more robust approach. This method can be enhanced using vector embeddings of attack prompts rather than relying only on prompt engineering techniques.
Vector databases enable the use of hybrid models, combining different types of features or techniques to improve overall security.
Security models can incorporate a broader range of information by storing different types of embeddings (dense, sparse, binary) alongside other data.

Detecting Anomalies: Vector databases can also be used to identify anomalous behavior. Security teams can detect deviations from these patterns by creating embeddings of normal system activities. Anomalous behavior will often result in embeddings that are distant from the norm.

Detecting unusual or unexpected behavior is critical in AI systems for identifying security breaches or system malfunctions. Vector databases make it possible to monitor system activity and quickly identify potentially malicious actions continuously.
Milvus enhances APK security and enables real-time threat detection. Other companies use Milvus or Zilliz to achieve faster data sourcing and labeling and to improve query precision with hybrid search.

Conclusion

The Gandalf experiment, which challenged users to trick an AI into revealing a secret password, showed weaknesses in language models (LLMs). The experiment showed how easily these models can be fooled with carefully written prompts.

It became clear that basic security measures, like simple prompt engineering, aren’t enough to stop these attacks. Even more advanced defenses, like using an LLM judge to monitor interactions, proved vulnerable.

The experiment also showed how useful vector databases can be for AI security. Vector embeddings help analyze attack patterns, detect unusual activity, and improve defenses. Overall, the experiment underlined the need for more intelligent, layered approaches to securing AI systems.

Further Resources

Updated on Mar 24, 2025

Haziqa Sajid
Digital Storytelling for Data, AI, B2B & SaaS

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

Zilliz Cloud Now Available in Azure North Europe: Bringing AI-Powered Vector Search Closer to European Customers

The addition of the Azure North Europe (Ireland) region further expands our global footprint to better serve our European customers.

Milvus WebUI: A Visual Management Tool for Your Vector Database

Milvus WebUI is a built-in GUI introduced in Milvus v2.5 for system observability. WebUI comes pre-installed with your Milvus instance and offers immediate access to critical system metrics and management features.

Top 10 AI Agents to Watch in 2025 🚀

AI agents are like having a supercharged assistant by your side—analyzing data, making decisions, and seamlessly integrating with tools and environments.