Blog
TL;DR Milvus Regression in LangChain v0.1.5

TL;DR Milvus Regression in LangChain v0.1.5

Feb 12, 20243 min read

PR submitted to fix Milvus integration with LangChain 0.1.5: https://github.com/langchain-ai/langchain/pull/17021

The Issue

You're encountering a "KeyError: 'pk'" error when using Langchain v0.1.5 to connect to Milvus. This error is due to a recent Milvus regression not automatically generating "pk" field (primary key) values. Milvus requires a primary key field for each document, but your LangChain text splitter does not include it, so the Milvus insert() method will raise "KeyError: 'pk'" is missing.

Temporary Solution (Downgrade):

Downgrade to Langchain version <= v0.1.4 until the fix is officially merged.

Permanent Solution (Fix):

Wait for the official fix, which will update the Langchain-Milvus integration to handle cases where "pk" is not present during insert().

Technical Details:

Milvus requires a primary key field for each document.
LangChain's "MilvusVectorstore" schema uses the field "pk" as the primary key.
LangChain’s text_splitter.split_documents() does not add a "pk" field by default.
Using the default LangChain chains with LangChain v0.1.5 and Milvus will raise an error "pk" not found during its call to the Milvus insert() method.
Downgrading Langchain <= v0.1.4 temporarily avoids the conflict.
An upcoming fix in Langchain v0.1.5 will address this issue permanently.

Background:

Recently, Milvus introduced a regression in its integration with LangChain v0.1.5, which added a new parameter "auto_id" for the MilvusVectorstore with default value = "False". Since auto_id is False, the user must provide a primary key value for every call to the insert() method. In LangChain, the default schema defines "pk" as the primary key; however, documents generated by text_splitter.split_documents() lack the necessary "pk" field for insertion.

Until the fix PR gets merged, you can workaround by downgrading the langchain version to <=v0.1.4, to temporarily avoid the conflict.

An upcoming fix in Langchain v0.1.5 will address this issue permanently, to ensure that the "pk" field is generated automatically in case it is not found in the data to be inserted.

Until the fix, with Langchain >= 0.1.5 you may see errors such as:

File "/Library/Python/3.9/site-packages/langchain_community/vectorstores/milvus.py", line 586, in <listcomp> insert_list = [insert_dict[x][i:end] for x in self.fields] KeyError: 'pk'

The code sample below demonstrates how Langchain interacts with Milvus for embedding and saving text documents:

# Load text data
loader = TextLoader("result.txt")
documents = loader.load()

# Split documents into chunks
text_splitter = CharacterTextSplitter(chunk_size=1024, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# Create embeddings
embeddings = HuggingFaceEmbeddings()

# Connect to Milvus
connection_args = {
    'uri': ZILLIZ_CLOUD_URI,
    'token': ZILLIZ_CLOUD_API_KEY
}
vector_db = Milvus(
    embedding_function=embeddings,
    connection_args=connection_args,
    collection_name='abc',
).from_documents(
    docs,
    embedding=embeddings,
    collection_name='abc',
    connection_args=connection_args,
)

# LangChain v0.1.5 code will fail before reaching this step.
# Perform a similarity search
query = "test"
docs2 = vector_db.similarity_search(query)

Additional Notes:

LangChain’s Milvus.from_documents() method does two things:

Calls an embedding method to convert text to embedding.
Calls the Milvus collection.insert() method to insert data into milvus, including the pk, vector, and metadata fields of the document.

Milvus.from_documents() is a method of the VectorStore class of Langchain: https://github.com/langchain-ai/langchain/blob/22d90800c86799a3a385b73ba09608c9b6565e0a/libs/core/langchain_core/vectorstores.py#L499

class VectorStore(ABC):
  def from_documents(
        cls: Type[VST],
        documents: List[Document],
        embedding: Embeddings,
        **kwargs: Any,
    )

All the vector database adapters (including Milvus) are derived from this class and implement this method.

The "Embeddings" is an abstract class to unify the interfaces of embedding models: https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/embeddings.py

class Embeddings(ABC):
    """Interface for embedding models."""

    @abstractmethod
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Embed search docs."""

    @abstractmethod
    def embed_query(self, text: str) -> List[float]:
        """Embed query text."""

If you want to wrap a custom embedding model, you can declare a class that implements those two methods.

Keep Reading

Vector Databases vs. Time Series Databases

Use a vector database for similarity search and semantic relationships; use a time series database for tracking value changes over time.

Vector Databases vs. NewSQL Databases

Use a vector database for AI-powered similarity search; use a NewSQL database for scalable transactional workloads requiring strong consistency and relational capabilities.

Matryoshka Representation Learning Explained: The Method Behind OpenAI’s Efficient Text Embeddings

Matryoshka Representation Learning (MRL) is a method for generating hierarchical, nested embeddings that capture information at multiple levels of abstraction.

TL;DR Milvus Regression in LangChain v0.1.5

The Issue

Temporary Solution (Downgrade):

Permanent Solution (Fix):

Technical Details:

Background:

Additional Notes:

Read more articles on Milvus LangChain integration:

Content

Start Free, Scale Easily

Share this article

Keep Reading

Vector Databases vs. Time Series Databases

Vector Databases vs. NewSQL Databases

Matryoshka Representation Learning Explained: The Method Behind OpenAI’s Efficient Text Embeddings

AI Assistant