Experimenting with Different Chunking Strategies via LangChain for LLM Apps
Splitting documents, or rather, "Chunking" is among the most challenging problems in building retrieval augmented generation (RAG) applications. What is chunking? Chunking is the process of dividing and organizing information into manageable or meaningful groups that we can then feed into our language models. This sounds pretty simple, but the devil is in the details. Depending on what your text looks like, you’ll want to chunk it up differently before feeding it into your language models. In this tutorial, we look at how different chunking strategies affect the same piece of data. The code for this post can be found in this GitHub Repo on LLM Experimentation.
Why do we even need to split our documents?
In retrieval-augmented generation (RAG) applications, one of the challenges developers face is effectively feeding a long document into large language models (LLMs). As LLMs like GPT-4 continue to advance, their ability to generate high-quality, contextually relevant responses hinges on the quality and structure of the input data they receive. This is where the process of splitting and organizing information into manageable segments, becomes indispensable.
Text splitting is not just a technical step; it’s a strategic process that directly influences the performance and reliability of LLM-powered applications. Without proper splitting, large documents can overwhelm LLMs, leading to inaccurate, incomplete, or irrelevant outputs. Developers often struggle with finding the optimal chunk strategy that balances the size and coherence of text segments, ensuring that each chunk retains enough context to be useful, while also being concise enough to be processed effectively by the model.
LangChain Text Splitter intro
LangChain is a large language model (LLM) Orchestration framework. They also have some built-in tools to split text and loading docs. Our tutorial involves minimal use of LLMs and mainly revolves around setting some chunk parameters. At a high level, we write a function that takes parameters to load in the doc and chunks it. This function prints out the retrieved chunks. To do experimentation, we run many chunks parameters through it, in particular playing with chunk sizes.
LangChain code imports and setup
This first section focuses on imports and other setup tools. Perhaps the first thing you notice about the code below is that there are a BUNCH of imports. The more used ones are: os, and dotenv, so I won’t cover them. They are simply used for your environment variables. Let’s step through text splitting in LangChain with python and the pymilvus client.
At the top, you’ll see our three imports for getting the doc in. First, there’s NotionDirectoryLoader, which loads a directory with markdown/Notion docs. Then, we have the Markdown Header and Recursive Character text splitters. These split the text within the markdown doc based on headers (the header splitter), or a set of pre-selected character breaks (the recursive splitter).
Next, we’ve got the retriever imports. Milvus is our vector database, OpenAIEmbeddings is our embedding model, and OpenAI is our LLM. The SelfQueryRetriever is the LangChain native retriever that allows a vector database to “query itself”. I wrote more about using LangChain to query a vector database in this piece.
Our last LangChain import is AttributeInfo, which passes in an attribute with info into the self-query retriever, as one may guess. Lastly, I want to touch on the pymilvus imports. These are strictly for utility reasons; we don’t need these to work with a vector database in LangChain. I use these imports to clean up the database at the end.
The last thing we do before writing the function is load our environment variables and declare some constants. The headers_to_split_on variable is essential - it lists all the headers we expect to see and want to split on in the markdown. path just tells LangChain where to find the Notion docs.
import os
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from langchain.vectorstores import Milvus
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from pymilvus import connections, utility
from dotenv import load_dotenv
load_dotenv()
zilliz_uri = os.getenv("ZILLIZ_CLUSTER_01_URI")
zilliz_token = os.getenv("ZILLIZ_CLUSTER_01_TOKEN")
headers_to_split_on = [
("##", "Section"),
]
path='./notion_docs'
Building a chunking experimentation function
Building the experimentation function is the most critical part of the tutorial. As mentioned, this function takes some parameters for document ingestion and experimentation. We need to provide the path to the docs, the headers to split on (splitters), the chunk size, the maximum chunk size overlap, and whether or not we want to clean up by dropping the collections at the end. The collection drop defaults to true
If we can avoid it, we want to create and drop collections as sparingly as possible because they have overheads we can prevent. You may see the script change as I look for good workarounds.
This function is very similar to the one we linked above about using Notion with LangChain. The first section loads the document from the path using the Notion Directory Loader. Notice that we only grab the html content for the first web page (and only have one page).
Next, we grab our splitters. First, we use the markdown splitter to split on the headers we passed in above. Then, we use our recursive splitter method and split based on the chunk size and overlap.
That’s all the splitting we need. With the splitting done, we give a collection name and initialize a LangChain Milvus instance using the default environment variables, OpenAI embeddings, splits, and the collection name. We also create a list of metadata fields via the AttributeInfo object to tell the self-query retriever we have “sections.”
With all this setup, we get our LLM and then pass it into a python self-query retriever. From there, the retriever does its magic when we ask it a question about our docs. I’ve set it up also to tell us which chunk strategy we are testing. Finally, we can drop the collection if we’d like.
def test_langchain_chunking(docs_path, splitters, chunk_size, chunk_overlap, drop_collection=True):
path=docs_path
loader = NotionDirectoryLoader(path)
docs = loader.load()
md_file=docs[0].page_content
# Let's create groups based on the section headers in our page
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=splitters)
md_header_splits = markdown_splitter.split_text(md_file)
# Define our text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
all_splits = text_splitter.split_documents(md_header_splits)
test_collection_name = f"EngineeringNotionDoc_{chunk_size}_{chunk_overlap}"
vectordb = Milvus.from_documents(documents=all_splits,
embedding=OpenAIEmbeddings(),
connection_args={"uri": zilliz_uri,
"token": zilliz_token},
collection_name=test_collection_name)
metadata_fields_info = [
AttributeInfo(
name="Section",
description="Part of the document that the text comes from",
type="string or list[string]"
),
]
document_content_description = "Major sections of the document"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(llm, vectordb, document_content_description, metadata_fields_info, verbose=True)
res = retriever.get_relevant_documents("What makes a distinguished engineer?")
print(f"""Responses from chunking strategy:
{chunk_size}, {chunk_overlap}""")
for doc in res:
print(doc)
# this is just for rough cleanup, we can improve this
# lots of user considerations to understand for real experimentation use cases though
if drop_collection:
connections.connect(uri=zilliz_uri, token=zilliz_token)
utility.drop_collection(test_collection_name)
LangChain tests and results
Alright, now comes the exciting part! Let’s look at the tests and results.
Code to test LangChain chunks
This brief block of code below is how we can run our function for experimentation. I’ve added five experiments. This tutorial tests chunk strategies from 32 to 512 in length by powers of 2 with overlaps running from 4 to 64 also by powers of 2. To test, we loop through the list of tuples and call the function we wrote above.
chunking_tests = [(32, 4), (64, 8), (128, 16), (256, 32), (512, 64)]
for test in chunking_tests:
test_langchain_chunking(path, headers_to_split_on, test[0], test[1])
Here’s what the entire output looks like. Now let’s take a peek into individual outputs. Remember that our chosen example question is: "What makes a distinguished engineer?"
Length 32, overlap 4
Okay so from this, we can clearly see that 32 is too short. This sentence is entirely useless. “Is a Distinguished Engineer” is the most circular reasoning possible.
Length 64, overlap 8
64 and 8 isn’t much better from the start. It does give us an example of a distinguished engineer though. Werner Vogels, CTO of Amazon.
Length 128, overlap 16
At 128, we start to see more complete sentences. Less “engineer.” type words and responses. This is not bad, it manages to extract the piece about Werner Vogel and “Has achieved noteworthy technical, professional accomplishments while working as an engineer.” The last entry is actually from the principal engineer section.
One downside here is that we already see examples of these special characters like \xa0
and \n
pop up. This tells us that perhaps we’re going too far on the chunking length.
Length 256, overlap 32
I think this chunking length is definitely too long. It pulls the required entries, but also pulls entries from “Fellow”, “Principal Engineer”, and “Senior Staff Engineer”. The first entry is from Distinguished Engineer though, and it covers three points on it.
Chunk length 512, overlap 64
We already established that 256 is probably too long. However, this 512’s first pull is actually the entire section for distinguished engineers. Now we have a dilemma - do we want individual “lines” or “notes” or to pull an entire section? That depends on your use case.
Summary of experimenting with different chunking strategy
Cool, so, we saw five different text splitter strategies with a parameterized approach that highlights chunks size and chunks overlap strategies in this tutorial using langchain chunking python. One of the dilemmas we saw from just doing these simple 5 chunk strategies is between getting individual tidbits and an entire section back depending on chunk size. We saw that 128 was pretty good for getting individual “lines” or “notes” about distinguished engineers, but that 512 could get our entire section back.
However, 256 wasn’t that good.
These three data points tell us something about the text splitter. It’s not just that finding an ideal chunks size is difficult. It’s also a sign that you need to think about what you want from your responses when crafting your chunk sizes as well.
Note that we haven’t even gotten to testing out different overlaps. After learning and getting a good chunk strategy down, checking overlaps is the logical next step. Maybe we’ll cover it in a future tutorial, maybe with another library. Stay tuned!
Chunking References
Chunking, or text splitter strategies continue to evolve, so we have started to build a collection of these different strategies to take a look at and potentially implement in your application. Enjoy!
A Guide to Chunking Strategies for Retrieval Augmented Generation (RAG). We explored various facets of chunking strategies within Retrieval-Augmented Generation (RAG) systems in this guide.
A Beginner's Guide to Website Chunking and Embedding for Your RAG Applications. In this post, we'll explain how to extract content from a website and use it as context for LLMs in a RAG application. However, before doing so, we need to understand website fundamentals.
Exploring Three Key Strategies for Building Efficient Retrieval Augmented Generation (RAG). Retrieval Augmented Generation (RAG) is a useful technique for using your own data in an AI-powered Chatbot. This blog post will walk you through three key strategies to get the most out of RAG.
Pandas DataFrame: Chunking and Vectorizing with Milvus. If we store all of the data, including the chunk text and the embedding, inside of Pandas DataFrame, we can easily integrate and import them into the Milvus vector database.
- Why do we even need to split our documents?
- LangChain Text Splitter intro
- LangChain tests and results
- Summary of experimenting with different chunking strategy
- Chunking References
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for Free