My name is Christy Bergman, and I recently joined Zilliz as a Developer Advocate based in San Francisco. I will be organizing, attending, and speaking at events, writing blogs and code tutorials, improving documentation, and generally helping people (like you hopefully!) learn how to use Milvus, the world’s most popular open-source vector database (measured by GitHub stars).
Before Zilliz, I was a Developer Advocate at Anyscale, the open-source Ray creator, who makes distributed computing easy for all developers. Before that, I was an AWS AI/ML Specialist Solutions Architect focused on deep learning forecasting.
So why did I choose to join Zilliz? It began with my curiosity about improving ChatGPT chat outputs by injecting context into the prompt. I then started a quest to tinker with embeddings, and from there, I started looking for an easy-to-use, fast vector database. In this blog post, I'll take you through the steps that brought me here.
The doubt: do I really need a vector database?
At the outset, I questioned whether a vector database was even necessary. However, a few attempts at embedding text into a Python variable, pickling and unpickling it, made me believe embeddings must be cached or persisted somewhere better than pickle.
Milvus is a full-fledged database purpose-built from the ground up for vectors. Thousands of developers worldwide use Milvus, which includes support for many different vector search algorithms, vector-space distance metrics, and other features like Insert/upsert, TopK and Range Search, GPU support, and more. The commercial version is called Zilliz Cloud and has a 99.9% SLA availability, auto-scaling, and runs on AWS and GCP (Azure coming soon), and you can run on your cloud account with full SOC2 compliance.
Exploring the options: trying different vector databases
To get embeddings into a vector database, you need: 1) unstructured data, 2) plan for how to chunk that data, 3) an embedding model, and 4) a vector database.
For the data, I used the IMDB large movie review dataset from the Stanford AI Lab. It is a conveniently processed 50,000 dataset (50:50 sampled ratio Positive/Negative reviews). This data has columns: movie_index, raw review text, and movie rating.
For the chunking, I chose to keep whole movie reviews intact, unless they were very long. For very long reviews, I used the typical default 512 length chunks. That means long reviews were split into several 512-character-length pieces. In LLM lingo, this is called “chunk size”.
For the embedding model, I tried various models including the ubiquitous OpenAI embeddings. Until I figured out the easiest is just to select the always-moving-target of smallest, top-ranked-on-the-Retriever-tab from the MTEB HuggingFace leaderboard. At the time, I settled on “e5-base-v2”. It is actually a good thing that the choice of the embedding model is independent from the vector database itself.
For the databases, I tried FAISS, Qdrant, Chroma, Weaviate, Pinecone, and, of course, Milvus. I put them all through the same paces: 1) How easy and fast is it to load directly from pandas into the vector database? 2) How easy, fast, and tunable is it to get good query results from the vector database? 3) How easy and tunable is it to perform RAG using LangChain (future blog post!).
Milvus stands out
Among all the databases I explored, Milvus caught my attention for several reasons. First, it offered a user-friendly experience, especially when inserting data directly from pandas with metadata, since this is the first approach a Data Scientist might try.
Milvus also caught my attention because of its speed in loading vectors and querying. Milvus was noticeably more zippy than other approaches, some of which were slow, even at this small scale. I also noticed the features and chose the default settings for other databases (IVF-flat exhaustive search with L2-metric) even though Milvus supports other choices.
Below is a gist for inserting a pandas data frame into Milvus. The complete code is on my GitHub.
# Complete code is at: https://github.com/christy/ZillizDemos/blob/main/milvus_onboarding/hello_world_milvus.ipynb
###########
# 1. Download to use locally, a sentence transformer from HuggingFace Hub.
###########
import os
from dotenv import load_dotenv, find_dotenv
from huggingface_hub import login
_ = load_dotenv(find_dotenv())
hub_token = os.getenv("HUGGINGFACEHUB_API_TOKEN")
login(token=hub_token);
from sentence_transformers import SentenceTransformer
model_name = "BAAI/bge-base-en-v1.5"
retriever = SentenceTransformer(model_name, device="cuda")
# Get the model parameters and save for later.
MAX_SEQ_LENGTH = retriever.get_max_seq_length()
HF_EOS_TOKEN_LENGTH = 1
EMBEDDING_LENGTH = retriever.get_sentence_embedding_dimension()
###########
# 2. Choose a chunking function from LangChain for convenience.
###########
from langchain.text_splitter import RecursiveCharacterTextSplitter
chunk_size = MAX_SEQ_LENGTH - HF_TOKEN_EOS_LENGTH
chunk_overlap = np.round(chunk_size * 0.10, 0)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
)
def chunk_text(text):
chunks = text_splitter.split_text(text)
return [chunk for chunk in chunks if chunk]
###########
# 3. Manipulate a pandas dataframe.
# Download: https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
###########
# 1. Batch of data from pandas DataFrame.
batch = df.head(100).copy()
# 2. Change primary key type to string.
batch["movie_index"] = batch["movie_index"].apply(lambda x: str(x))
# 3. Truncate reviews to 512 characters.
batch['chunk'] = batch['text'].apply(chunk_text)
batch = batch.explode('chunk', ignore_index=True)
# 4. Add embeddings as new column in df.
review_embeddings = torch.tensor(retriever.encode(batch['chunk']))
# Normalize embeddings to unit length.
review_embeddings = F.normalize(review_embeddings, p=2, dim=1)
# 5. Convert embeddings to list of `numpy.ndarray`, each containing `numpy.float32` numbers.
converted_values = list(map(np.float32, review_embeddings))
batch['embeddings'] = converted_values
# 6. Reorder columns for conveneince, so index first, labels at end.
new_order = ["movie_index", "text", "chunk", "embeddings", "label_int", "label"]
batch = batch[new_order]
###########
# 4. Install and import milvus.
###########
!pip install milvus pymilvus
import milvus, pymilvus
###########
# 5. Define schema for data loading into Milvus.
###########
# Set the Milvus collection name.
COLLECTION_NAME = "movies"
fields = [
FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="movie_index", dtype=DataType.VARCHAR, max_length=8),
FieldSchema(name="chunk", dtype=DataType.VARCHAR, max_length=MAX_SEQ_LENGTH),
FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=EMBEDDING_LENGTH),
FieldSchema(name="label_int", dtype=DataType.INT64),
FieldSchema(name="label", dtype=DataType.VARCHAR, max_length=8),
]
schema = CollectionSchema(fields, "Search imdb movie reviews")
mc = Collection(COLLECTION_NAME, schema, consistency_level="Eventually")
###########
6. Start a local Milvus server (lite).
###########
default_server.start()
connections.connect(host='127.0.0.1',
port=default_server.listen_port,
show_startup_banner=True)
###########
# 7. Insert data into Milvus.
###########
# Showing how to change the index parameters, instead of using defaults.
index_params = {
"index_type": "HNSW",
"metric_type": "COSINE",
"params": {'M': 16, # int. 4~64, num_layers
"efConstruction": 32} # int. 8~512, num_nearest_neighbors
}
mc.create_index("embeddings", index_params)
insert_result = mc.insert(batch)
# After final entity is inserted, call flush to stop growing segments left in memory.
mc.flush()
###########
# 8. Query the database.
###########
# Before conducting a search or a query, you need to load the data into memory.
mc.load()
# Define a sample question about your data.
query = "I'm a medical doctor, what movie should I watch?"
# Embed the query using same embedding model used to create the Milvus collection.
query_embeddings = torch.tensor(retriever.encode([query]))
# Normalize embeddings to unit length.
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
# Convert the embeddings to list of list of np.float32.
query_embeddings = list(map(np.float32, query_embeddings))
# Execute a vector search with HNSW index.
TOP_K = 3
search_params = {
"M": 16,
"ef": 32,
}
results = mc.search(
data=query_embeddings,
anns_field="embeddings",
param=search_params,
output_fields=["movie_index", "chunk", "label"],
limit=TOP_K,
consistency_level="Eventually"
)
After my (quick, unofficial) testing, I filled out an online form to apply for a Developer Advocate opening at Zilliz. After I had my last interview, the Zilliz team was quick to make me an offer. So, without hesitation, I accepted!
I have fabulous co-workers. See Yujian Tang’s and Frank Liu’s blogs for why they joined Zilliz. My boss, Chris Churilo, is the secret glue why we all work together so efficiently. She uses her knowledge and experience to make us all shine. My colleague Filip Haltmayer is exceptionally patient and good at explaining all things Milvus and Zilliz to me during my onboarding. Our CEO, Charles Xie, had the original vision in 2017 to build a database, especially for unstructured data.
What’s next
As I get started doing evangelism work for Milvus, I look forward to getting to know all of you in the AI developer community! I’ll be organizing the Unstructured Data Meetup in San Francisco, where I always look for great speakers, venues, and co-hosts (please hit me up!). I’ll be blogging and speaking more (please reach out!). If you want to try open-source Milvus in your unstructured data project, please let me know how I can help!
I’m involved with several data science for good organizations, including DataKind, Women in Data Science, and Code for America’s San Francisco Civil Hack nights (which introduced me to my local Sonoma Safe Agriculture org, where I created an interactive pesticides map). Please join me as I bring my new vector database learnings to the greater data-science-for-good community, too.
- The doubt: do I really need a vector database?
- Exploring the options: trying different vector databases
- Milvus stands out
- What’s next
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for Free