Setting Up With Facebook AI Similarity Search (FAISS)
Hello, tech enthusiasts! Today, we are embarking on a journey to explore a tool that has been making waves in artificial intelligence: Facebook’s AI Similarity Search (FAISS). Suppose you’ve wondered how Spotify finds songs that sound eerily similar to your favorites or how Google Photos manages to group pictures of the same person. In that case, you’re about to uncover the mystery.
Similarity search, or nearest neighbor search, is a crucial aspect of many AI and machine learning applications. It’s all about finding the data points most similar to a given query point. For instance, when you search for a song on Spotify, the system needs to find the songs that are most “similar” to your search query—hence, a similarity search.
As impressive as this efficient similarity search sounds, there’s a catch. Traditional similarity search methods can become dreadfully slow with large amounts of data. That’s where FAISS comes in and solves the limitations of traditional query search engines. It’s a library for efficient similarity developed by Facebook AI that provides reliable solutions to similarity search problems, especially when dealing with large-scale data.
But enough with the chitchat! This blog post will guide you through setting up FAISS, getting it up and running, and demonstrating its power through a sample similarity search program. So buckle your seatbelts because we’re about to dive deep into the fascinating world of efficient similarity search with FAISS. It’s going to be a fun ride!
Understanding FAISS (Facebook AI Similarity Search)
Now that we’ve whetted our appetites with a quick introduction, let’s delve deeper into FAISS. FAISS, or Facebook AI Similarity Search, is a library of algorithms for vector similarity search and clustering of dense vectors. It’s the brainchild of Facebook’s AI team, which designed it to handle large databases efficiently.
FAISS primarily functions on the concept of “vector similarity.” In layperson’s terms, vectors are essentially a list of numbers, and similarity is about how alike two vectors are. Imagine you’re trying to find a song that matches the mood of your current favorite. Vectors can represent both songs, with different elements of similar vectors representing different song features. You can compare the “similarity” between these songs by comparing the distance of their vectors in a high-dimensional space. Euclidean distance plays a crucial role in measuring this similarity between vectors.
Here’s where FAISS flexes its muscles. It provides a way to quickly and accurately compare millions, or even billions, of these vectors. It’s like having a supercharged search engine that can scan through an enormous music library in a blink, pinpointing the songs most similar to your favorite one. Indexed vectors are essential in this process, as they allow the system to efficiently search for the closest matches to a given query vector.
But the magic of FAISS doesn’t stop at music recommendations. Many applications use FAISS, from image recognition and text retrieval to clustering and data analysis. Whenever you have a large amount of data and need to find similar items quickly, FAISS could be your go-to tool.
Setting up FAISS
This section will guide you on how to set up FAISS on a Linux system.
Installing Conda
Before installing FAISS, you need to have Conda installed on your system. Conda is an open-source package and environment management system that runs on Windows, macOS, and Linux.
Follow these steps to install Conda on a Linux system:
Download the Miniconda installer for Linux from the official website.
Verify your installer hashes.
Open a terminal window and run the following command to start the installation:
bash Miniconda3-latest-Linux-x86_64.sh
When the installer starts, it will ask you some questions. If you're not sure about something, go with the default options. You can always change things later.
Once you're done with the installation, close your terminal window and open it again. This will ensure that any changes you make are activated.
Now, you'll want to check if everything is installed properly. To do this, type conda list into your terminal window or Anaconda Prompt and hit enter. If everything works right, you'll see a list of installed packages.
Installing FAISS
You can install FAISS via Conda. The FAISS package has two versions: a CPU-only version (faiss-cpu) and a version that includes both CPU and GPU indices (faiss-gpu). Depending on your needs, you can install either of these versions.
The recommended way to install FAISS is through the PyTorch Conda channel. Here are the commands to install the latest stable release of FAISS.
For the CPU version:
conda install -c pytorch faiss-cpu
For the GPU version:
conda install -c pytorch faiss-gpu
In addition, FAISS is packaged by conda-forge, which is a community-driven packaging ecosystem for Conda. You can install FAISS from conda-forge using the following commands. For the CPU-only version:
conda install -c conda-forge faiss-cpu
For the GPU version:
conda install -c conda-forge faiss-gpu
Sample code walkthrough using SQuAD
You can check which channel your Conda packages come from by using the conda list command.
Let’s use the Stanford Question Answering Dataset (SQuAD) for this demonstration. SQuAD is a popular dataset for natural language processing (NLP) and a great way to illustrate how FAISS works. This dataset contains question-answer pairs, where the answer to each question is a segment of text, or “span,” from the corresponding reading passage. In the search process, query vectors are used to find the most relevant answers by comparing them against the dataset.
Before we dive into the code, let’s first download and prepare the SQuAD dataset:
Download the SQuAD dataset: You can download the dataset from the SQuAD website. For simplicity, we will use SQuAD 1.1. You can download the dataset using the following link: SQuAD 1.1 Train. Download and save the JSON file (train-v1.1.json) in your working directory.
Read the JSON file: Now you can use the Python JSON library to load the data:
with open('train-v1.1.json', 'r') as file: squad_data = json.load(file)
Numerical vectors are essential in the FAISS index as they allow for efficient distance measurements and querying within the IndexFlatL2.
Importing necessary libraries
The first step is to import all the necessary libraries. We'll need numpy for numerical operations, Faiss for our vector similarity search, JSON for loading our dataset, and nltk to tokenize our text.
import numpy as np
import faiss
import json
from nltk.tokenize import word_tokenize
Loading and preprocessing data
Let's load the SQuAD dataset. It's a JSON file, so we can use the JSON module's load function.
with open('train-v1.1.json', 'r') as file:
squad_data = json.load(file)
We'll assume that the JSON object is a dictionary containing a data key and a list of dictionaries. Each dictionary in the list represents an article containing a paragraph key with a list of paragraphs.
Now, let's preprocess the data. We'll tokenize each paragraph using nltk's word_tokenize function, which splits a sentence into individual words. We'll then represent each word as a one-hot encoded vector.
vocabulary = set(word for article in squad_data['data'] for paragraph in article['paragraphs'] for word in word_tokenize(paragraph['context']))
word_to_index = {word: index for index, word in enumerate(vocabulary)}
def convert_text_to_vector(text):
words = word_tokenize(text)
bow_vector = np.zeros(len(vocabulary))
for word in words:
word in word_to_index:
bow_vector[word_to_index[word]] = 1
return bow_vector
paragraph_vectors = [convert_text_to_vector(paragraph['context']) for article in squad_data['data'] for paragraph in article['paragraphs']]
Building the index
Now that we have our data in the right format, we can build our FAISS index. We'll use the IndexFlatL2 index type, a basic L2 distance index.
dimension = len(vocabulary)
index = faiss.IndexFlatL2(dimension)
# Convert our list of NumPy arrays to a single 2D array
paragraph_vectors = np.stack(paragraph_vectors).astype('float32')
index.add(paragraph_vectors)
The IndexFlatL2 type requires that we specify the dimension of our data. Since each of our vectors is one-hot encoded, the dimension is the size of our vocabulary.
We then add our data to the index using the add method, which requires a 2D NumPy array.
Performing a FAISS vector search
With our index all set up, we can now play detective and find paragraphs in our dataset that most closely match our search query.
Here's our search function:
def search_for_paragraphs(search_term, num_results):
search_vector = convert_text_to_vector(search_term)
search_vector = np.array([search_vector]).astype('float32')
distances, indexes = index.search(search_vector, num_results)
for i, (distance, index) in enumerate(zip(distances[0], indexes[0])):
print(f"Result {i+1}, Distance: {distance}")
print(squad_data['data'][index]['paragraphs'][0]['context'])
print()
Our search term is "What is the capital of France?" and we want to find 5 results:
search_term = "What is the capital of France?"
search_for_paragraphs(search_term, 5)
The search_for_paragraphs() first turns our search term into an encoded vector. We then use this vector representation as the search method on our index. This needs a 2D array, so we add an extra dimension to our search vector.
The similarity search method also needs us to say how many results we want (that's what num_results is for). The search method gives us two 2D arrays: one for the distances of the nearest results and one for their indexes. We can use these indexes to find the actual paragraphs in our dataset. Then, we print out each result's ranking, similarity distance, and paragraph text.
And there you have it! This is a basic example of using FAISS to find similar text data. Of course, FAISS can do way more complex things, like searching in high-dimensional vector spaces. However, this example should give you a good starting point for using FAISS.
Best practices and tips
Got your data? Get to know it: Before you use FAISS, take a moment to get to know your data. Ask questions like these: Is it filled with high numbers? Is it full of gaps or packed with information? Knowing your data will help you pick the correct type of FAISS index and determine the best way to get your data ready.
Preprocessing is key: How you prep your data can significantly affect how well FAISS works. For text data, think about using smarter ways of turning words into numbers, like TF-IDF or Word2Vec, instead of just one-hot encoding. For pictures, try using features from a convolutional neural network (CNN).
Pick the best index for you: FAISS has various index types, each with special strengths. Some are great for dealing with data that has lots of dimensions, others are perfect for binary vectors, and some are made for handling big, big data. So make sure you choose the one that will work best for your needs.
Batch your queries: If you have too many queries to run against your index, it's more efficient to batch them together and run them all at once. FAISS is optimized for batch processing.
Tune your parameters: FAISS has several tunable parameters, like the number of clusters in the indexing stage and the number of probes in the vector similarity search stage. Don't just stick with the defaults; experiment with different settings to see what works best for your data.
Vector databases vs FAISS
FAISS is a great solution for ANN search. Additionally, FAISS provides supporting code that offers additional functionalities for evaluation and parameter tuning related to similarity search and clustering of dense vectors. Still, it has some limitations when you have tens of millions of vectors for storage and retrieval and simultaneously require real-time responses or advanced query vector-related features.
Compared to FAISS, purpose-built vector databases like Milvus and Zilliz Cloud can address the challenges mentioned above and have more advanced capabilities in the:
Basic functionalities such as CRUD support, data consistency, and filter search
System availability with strong data persistency and better disaster recovery
System scalability with load balancing support, a distributed architecture that separates computing and storage, and better usability
RBAC with support for multi-tenant, SDKs of various programming languages, restful API, and a monitoring system.
Milvus is the world’s first and most popular open-source vector database for billion-scale similarity search and AI applications. Milvus can store, index, and manage a billion+ vector embeddings generated by deep neural networks and other machine learning (ML) models. To make vector databases accessible to every developer and organization, Zilliz contributed Milvus to the LF AI & Data Foundation as an incubation-stage project, and it graduated in June 2021.
Milvus Lite is a lightweight version of Milvus that runs locally within your Python application. Based on the popular open-source Milvus vector database, Milvus Lite reuses the core components for vector indexing and query parsing while removing elements designed for high scalability in distributed systems. This design makes a compact and efficient solution ideal for environments with limited computing resources, such as laptops, Jupyter Notebooks, and mobile or edge devices.
Zilliz Cloud is a fully-managed vector database service built on Milvus. With Zilliz Cloud, vector retrieval is ten times faster, and deploying and scaling vector search applications is easier than ever. Zilliz Cloud also offers a free tier, giving every developer access to this cutting-edge technology without requiring any financial commitment.
Conclusion
And there you have it! Together, we've traversed the exciting world of Facebook AI Similarity Search, or FAISS. From understanding what it is and how it works to set it up on your system to walking through some sample code with the SQuAD dataset and how it differs from purpose-built vector databases, we've covered a lot of ground.
Remember, FAISS is an incredibly powerful tool designed to make searching through massive amounts of data not only possible but efficient. Its versatility in accommodating different data types and sizes is a testament to its design.
As you venture forth, armed with this knowledge, remember the best practices and tips we discussed. Understanding your data, choosing the right index, preprocessing your data effectively, batching your queries, and tuning your parameters—all these steps can significantly improve your results.
But don't stop here. Continue exploring, experimenting, and learning. Whether diving deeper into the different index types that FAISS offers, exploring more complex data preprocessing techniques, or experimenting with more sophisticated use cases, there's always more to learn.
This post is written by Keshav Malik, a highly skilled and enthusiastic security engineer. Keshav is passionate about automation, hacking, and exploring different tools and technologies. He loves finding innovative solutions to complex problems and is constantly seeking new opportunities to grow and improve as a professional. He is dedicated to staying ahead of the curve and is always looking for the latest and greatest tools and technologies.
- Understanding FAISS (Facebook AI Similarity Search)
- Setting up FAISS
- Sample code walkthrough using SQuAD
- Best practices and tips
- Vector databases vs FAISS
- Conclusion
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for FreeKeep Reading
- Read Now
Enhancing Your RAG with Knowledge Graphs Using KnowHow
Knowledge Graphs (KGs) store and link data based on their relationships. KG-enhanced RAG can significantly improve retrieval capabilities and answer quality.
- Read Now
Designing Multi-Tenancy RAG with Milvus: Best Practices for Scalable Enterprise Knowledge Bases
We’ve explored how multi-tenancy frameworks play a critical role in the scalability, security, and performance of RAG-powered knowledge bases.
- Read Now
A Different Angle: Retrieval Optimized Embedding Models
This blog will demonstrate how GCL can be integrated with Milvus, a leading vector database, to create optimized Retrieval-Augmented Generation (RAG) systems.