What is Pymilvus?
Introduction
Pymilvus is the Python SDK built for Milvus and Zilliz Cloud. It is a gRPC-based client that uses a common Milvus protobuf shared across all SDKs. With both new and advanced users, it provides access to all features offered by Milvus and is one of our most popular SDKs.
Problems We Were Finding
The key idea around the Milvus vector database is to give users as many levers as possible to help fine-tune the system to their specific use case. However, approximate search is, at its core, approximate; there is no one-size-fits-all algorithm. Each of the algorithms excels in compute vs. speed vs. recall/accuracy. Even further, each of the algorithms can be configured to achieve a different balance of the previous categories. For example, HNSW vs. IVF-PQ. IVF-PQ is optimized for compute and speed, significantly reducing the memory burden and search times at the cost of significantly reduced recall. HNSW is the opposite; HNSW sacrifices computing for speed and recall. The exciting thing is that their strengths can be swapped through configuration. All this configuration is just at the index level. At the cluster level, there are balances that need to be done in terms of cost vs speed, which is unique to each user. Some users might not care about high availability and don't want replication. Some users might not care about consistency and would rather be less consistent at the benefit of speed. Some users might want to attach a TTL to their data to reduce storage costs, while others wish to have every operation backed up. Although this customizability is excellent for large users where eking out the balance is beneficial on the billion-sized datasets, for smaller users, all these levers just add too much confusion.
In addition to these levers being an issue, at the moment, Zilliz Cloud and Milvus are not interchangeable due to indexes and connection parameters.
What is MilvusClient
MilvusClient is an attempt to simplify the API for most users. Many users do not want to deal with connections, schemas, indexing, loading, params, etc. MilvusClient hides all these aspects by wrapping the Pymilvus SDK with a simple API that is identical for both Milvus and Zilliz. At the moment, it offers:
- insert_data()
- upsert_data()
- search_data()
- query_data()
- get_vectors_by_pk()
- delete_by_pk()
- add_partition()
- remove_partition()
These functions were decided to be the key functions that any basic user would need. At the current moment, operations are offered by Pymilvus, but there are a lot of extra things that need to be done to ensure that they work correctly.
insert_data:
Using MilvusClient, there's no need to create a schema for a collection. Instead, the schema is autogenerated from the inserted data. This involves determining the required FieldSchema and how to organize it. Many users found creating this schema a pain point, which is why we're doing it behind the scenes. Once dynamic schema is supported, we can change the implementation without changes to the user.
def _infer_fields(self, data):
"""Infer all the fields based on the input data."""
# TODO: Assuming ordered dict for 3.7
fields = {}
# Figure out each datatype of the input.
for key, value in data.items():
# Infer the corresponding datatype of the metadata
dtype = infer_dtype_bydata(value)
# Datatype isnt compatible
if dtype in (DataType.UNKNOWN, DataType.NONE):
logger.error(
"Failed to parse schema for collection %s, unrecognized dtype for key: %s",
self.collection_name,
key,
)
raise ValueError(f"Unrecognized datatype for {key}.")
# Create an entry under the field name
fields[key] = {}
fields[key]["name"] = key
fields[key]["dtype"] = dtype
# Area for attaching kwargs for certain datatypes
if dtype == DataType.VARCHAR:
fields[key]["max_length"] = 65_535
return fields
For MilvusClient, we wanted to stick to a data format similar to other projects in the area - a list of dictionaries. This format is easy to understand and work with and is equivalent to the Documents found in LlamaIndex and LangChain. However, this format is incompatible with pymilvus, as pymilvus' insert takes columnar data as a list of lists. This columnar data format is not easy to work with, as it involves ordering your lists to the exact order that your schema was done in and did not offer any flexibility. In addition, the error messages received for incorrectly formatted data could be better.
for k in data:
for key, value in k.items():
if key in self.fields:
insert_dict.setdefault(key, []).append(value)
for i in self.tqdm(range(0, len(data), batch_size), disable=not progress_bar):
# Convert dict to list of lists batch for insertion
try:
insert_batch = [
insert_dict[key][i : i + batch_size]
for key in self.fields
if key != ignore_pk
]
With all the changes coming in the future for schema, JSONs, etc., having a wrapper around a simple insert will allow us to do the heavy lifting and leave a simple API for the user.
upsert_data:
In version 2.2, upsert does not exist in Milvus. To perform an upsert, it is necessary to perform a delete -> insert. Because many users are looking for this feature, we decided to include it in the client. Once the feature is added to pymilvus, we will be able to easily change it without requiring changes in the user's code.
pks = [x[self.pk_field] for x in data]
self.delete_by_pk(pks, timeout)
ret = self.insert_data(
data=data,
timeout=timeout,
batch_size=batch_size,
partition=partition,
progress_bar=progress_bar,
)
search_data:
The key modifications to the search command were default search params and converting the outputs into a list of dictionaries output.
ret = []
for hits in res:
query_result = []
for hit in hits:
ret_dict = {x: hit.entity.get(x) for x in return_fields}
query_result.append({"score": hit.score, "data": ret_dict})
ret.append(query_result)
query_data:
The key modifications to the query command were the conversion of outputs into a list of dictionaries.
get_vectors_by_pk:
Extracting vectors in pymilvus is done through querying. What many users do not know is that the query needs to be done based on a primary key filter. In addition to this, if using a varchar primary key, the expression in the query would need to have escaped quotation marks, something that is tedious and unknown to users.
# Varchar pks need double quotes around the values
if self.fields[self.pk_field] == DataType.VARCHAR:
ids = ['"' + str(entry) + '"' for entry in pks]
expr = f"""{self.pk_field} in [{','.join(ids)}]"""
else:
ids = [str(entry) for entry in pks]
expr = f"{self.pk_field} in [{','.join(ids)}]"
delete_by_pk:
Similar to get_vectors_by_pk.
add_partition and delete_partition:
For the partition logic, users need to know that changing partitions requires the unloading and loading of a collection. This is now handled behind the scenes.
Conclusion:
Overall, the main goal of this client is to add easy-to-use operations that don't exist or are unoptimized on the pymilvus side. As pymilvus improves, we will be able to optimize these operations behind the scenes and keep a simple-to-use API.
- Introduction
- Problems We Were Finding
- What is MilvusClient
- insert_data:
- upsert_data:
- search_data:
- query_data:
- get_vectors_by_pk:
- delete_by_pk:
- add_partition and delete_partition:
- Conclusion:
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for Free