TL;DR Milvus Regression in LangChain v0.1.5
PR submitted to fix Milvus integration with LangChain 0.1.5: https://github.com/langchain-ai/langchain/pull/17021
The Issue
You're encountering a "KeyError: 'pk'" error when using Langchain v0.1.5 to connect to Milvus. This error is due to a recent Milvus regression not automatically generating "pk" field (primary key) values. Milvus requires a primary key field for each document, but your LangChain text splitter does not include it, so the Milvus insert() method will raise "KeyError: 'pk'" is missing.
Temporary Solution (Downgrade):
Downgrade to Langchain version <= v0.1.4 until the fix is officially merged.
Permanent Solution (Fix):
Wait for the official fix, which will update the Langchain-Milvus integration to handle cases where "pk" is not present during insert()
.
Technical Details:
Milvus requires a primary key field for each document.
LangChain's "MilvusVectorstore" schema uses the field "pk" as the primary key.
LangChain’s
text_splitter.split_documents()
does not add a "pk" field by default.Using the default LangChain chains with LangChain v0.1.5 and Milvus will raise an error "pk" not found during its call to the Milvus
insert()
method.Downgrading Langchain <= v0.1.4 temporarily avoids the conflict.
An upcoming fix in Langchain v0.1.5 will address this issue permanently.
Background:
Recently, Milvus introduced a regression in its integration with LangChain v0.1.5, which added a new parameter "auto_id" for the MilvusVectorstore with default value = "False". Since auto_id is False, the user must provide a primary key value for every call to the insert()
method. In LangChain, the default schema defines "pk" as the primary key; however, documents generated by text_splitter.split_documents()
lack the necessary "pk" field for insertion.
Until the fix PR gets merged, you can workaround by downgrading the langchain version to <=v0.1.4, to temporarily avoid the conflict.
An upcoming fix in Langchain v0.1.5 will address this issue permanently, to ensure that the "pk" field is generated automatically in case it is not found in the data to be inserted.
Until the fix, with Langchain >= 0.1.5 you may see errors such as:
File "/Library/Python/3.9/site-packages/langchain_community/vectorstores/milvus.py", line 586, in <listcomp> insert_list = [insert_dict[x][i:end] for x in self.fields] KeyError: 'pk'
The code sample below demonstrates how Langchain interacts with Milvus for embedding and saving text documents:
# Load text data
loader = TextLoader("result.txt")
documents = loader.load()
# Split documents into chunks
text_splitter = CharacterTextSplitter(chunk_size=1024, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
# Create embeddings
embeddings = HuggingFaceEmbeddings()
# Connect to Milvus
connection_args = {
'uri': ZILLIZ_CLOUD_URI,
'token': ZILLIZ_CLOUD_API_KEY
}
vector_db = Milvus(
embedding_function=embeddings,
connection_args=connection_args,
collection_name='abc',
).from_documents(
docs,
embedding=embeddings,
collection_name='abc',
connection_args=connection_args,
)
# LangChain v0.1.5 code will fail before reaching this step.
# Perform a similarity search
query = "test"
docs2 = vector_db.similarity_search(query)
Additional Notes:
LangChain’s Milvus.from_documents()
method does two things:
Calls an embedding method to convert text to embedding.
Calls the Milvus collection.insert() method to insert data into milvus, including the pk, vector, and metadata fields of the document.
Milvus.from_documents()
is a method of the VectorStore class of Langchain: https://github.com/langchain-ai/langchain/blob/22d90800c86799a3a385b73ba09608c9b6565e0a/libs/core/langchain_core/vectorstores.py#L499
class VectorStore(ABC):
def from_documents(
cls: Type[VST],
documents: List[Document],
embedding: Embeddings,
**kwargs: Any,
)
All the vector database adapters (including Milvus) are derived from this class and implement this method.
The "Embeddings" is an abstract class to unify the interfaces of embedding models: https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/embeddings.py
class Embeddings(ABC):
"""Interface for embedding models."""
@abstractmethod
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Embed search docs."""
@abstractmethod
def embed_query(self, text: str) -> List[float]:
"""Embed query text."""
If you want to wrap a custom embedding model, you can declare a class that implements those two methods.
Read more articles on Milvus LangChain integration:
- **The Issue**
- **Temporary Solution (Downgrade):**
- **Permanent Solution (Fix):**
- **Technical Details:**
- **Background:**
- **Additional Notes:**
- **Read more articles on Milvus LangChain integration:**
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for Free