Experimenting with Different Chunking Strategies via LangChain

Chunking is among the most challenging problems in building retrieval augmented generation applications. Chunking is the process of dividing up the text. This sounds pretty simple, but the devil’s in the details. Depending on what your text looks like, you’ll want to chunk it up differently. In this tutorial, we look at how different chunking strategies affect the same piece of data. The code for this post can be found in this GitHub Repo on LLM Experimentation.
LangChain chunking intro
LangChain is an LLM Orchestration framework. They also have some built-in tools for chunking as well as loading docs. Our chunking tutorial involves minimal use of LLMs and mainly revolves around setting some chunking parameters. At a high level, we write a function that takes parameters to load in the doc and chunks it. This function prints out the retrieved chunks. To do experimentation, we run many chunking parameters through it.
LangChain chunking code imports and setup
This first section focuses on imports and other setup tools. Perhaps the first thing you notice about the code below is that there are a BUNCH of imports. The more used ones are: os
, and dotenv
, so I won’t cover them. They are simply used for your environment variables. Let’s step through the LangChain ones and pymilvus
though.
At the top, you’ll see our three imports for getting the doc in. First, there’s NotionDirectoryLoader
, which loads a directory with markdown/Notion docs. Then, we have the Markdown Header and Recursive Character text splitters. These split the text within the markdown doc based on headers (the header splitter), or a set of pre-selected character breaks (the recursive splitter).
Next, we’ve got the retriever imports. Milvus is our vector database. OpenAIEmbeddings is our embedding model. OpenAI is our LLM. The SelfQueryRetriever is the LangChain native retriever that allows a vector database to “query itself”. I wrote more about that in this piece about using LangChain to Query a Vector Database.
Our last LangChain import is AttributeInfo
, which passes in an attribute with info into the self-query retriever, as one may guess. Lastly, I want to touch on the pymilvus
imports. These are strictly for utility reasons; we don’t need these to work with a vector database in LangChain. I use these imports to clean up the database at the end.
The last thing we do before writing the function is load our environment variables and declare some constants. The headers_to_split_on
variable is essential - it lists all the headers we expect to see and want to split on in the markdown. path
just tells LangChain where to find the Notion docs.
import os
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from langchain.vectorstores import Milvus
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from pymilvus import connections, utility
from dotenv import load_dotenv
load_dotenv()
zilliz_uri = os.getenv("ZILLIZ_CLUSTER_01_URI")
zilliz_token = os.getenv("ZILLIZ_CLUSTER_01_TOKEN")
headers_to_split_on = [
("##", "Section"),
]
path='./notion_docs'
Building a chunking experimentation function
Building the chunking experimentation function is the most critical part of the tutorial. As mentioned, this function takes some parameters for document ingestion and chunking experimentation. We need to provide the path to the docs, the headers to split on (splitters
), the chunk size, the chunk overlap, and whether or not we want to clean up by dropping the collections at the end. The collection drop defaults to true
If we can avoid it, we want to create and drop collections as sparingly as possible because they have overheads that we can prevent. You may see the script change as I look for good workarounds.
This function is very similar to the one we linked above about using Notion with LangChain. The first section loads the document from the path using the Notion Directory Loader. Notice that we only grab the page content for the first page (and only have one page).
Next, we grab our splitters. First, we use the markdown splitter to split on the headers we passed in above. Then, we use our recursive splitter and split based on the chunking size and overlap.
That’s all the splitting we need. With the splitting done, we give a collection name and initialize a LangChain Milvus instance using the environment variables, OpenAI embeddings, splits, and the collection name. We also create a list of metadata fields via the AttributeInfo
object to tell the self-query retriever we have “sections.”
With all this setup, we get our LLM and then pass it into a self-query retriever. From there, the retriever does its magic when we ask it a question about our docs. I’ve set it up also to tell us which chunking strategy we are testing. Finally, we can drop the collection if we’d like.
def test_langchain_chunking(docs_path, splitters, chunk_size, chunk_overlap, drop_collection=True):
path=docs_path
loader = NotionDirectoryLoader(path)
docs = loader.load()
md_file=docs[0].page_content
# Let's create groups based on the section headers in our page
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=splitters)
md_header_splits = markdown_splitter.split_text(md_file)
# Define our text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
all_splits = text_splitter.split_documents(md_header_splits)
test_collection_name = f"EngineeringNotionDoc_{chunk_size}_{chunk_overlap}"
vectordb = Milvus.from_documents(documents=all_splits,
embedding=OpenAIEmbeddings(),
connection_args={"uri": zilliz_uri,
"token": zilliz_token},
collection_name=test_collection_name)
metadata_fields_info = [
AttributeInfo(
name="Section",
description="Part of the document that the text comes from",
type="string or list[string]"
),
]
document_content_description = "Major sections of the document"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(llm, vectordb, document_content_description, metadata_fields_info, verbose=True)
res = retriever.get_relevant_documents("What makes a distinguished engineer?")
print(f"""Responses from chunking strategy:
{chunk_size}, {chunk_overlap}""")
for doc in res:
print(doc)
# this is just for rough cleanup, we can improve this
# lots of user considerations to understand for real experimentation use cases though
if drop_collection:
connections.connect(uri=zilliz_uri, token=zilliz_token)
utility.drop_collection(test_collection_name)
LangChain chunking tests and results
Alright, now comes the exciting part! Let’s look at the tests and results.
Code to test LangChain chunking
This brief block of code below is how we can run our function for experimentation. I’ve added five experiments. This tutorial tests chunking strategies from 32 to 512 in length by powers of 2 with overlaps running from 4 to 64 also by powers of 2. To test, we loop through the list of tuples and call the function we wrote above.
chunking_tests = [(32, 4), (64, 8), (128, 16), (256, 32), (512, 64)]
for test in chunking_tests:
test_langchain_chunking(path, headers_to_split_on, test[0], test[1])
Here’s what the entire output looks like. Now let’s take a peek into individual outputs. Remember that our chosen example question is: "What makes a distinguished engineer?"
Chunk length 32, chunk overlap 4
Okay so from this, we can clearly see that 32 is too short. This is entirely useless. “Is a Distinguished Engineer” is the most circular reasoning possible.
Chunk length 64, chunk overlap 8
64 and 8 isn’t much better from the start. It does give us an example of a distinguished engineer though. Werner Vogels, CTO of Amazon.
Chunk length 128, chunk overlap 16
At 128, we start to see more complete sentences. Less “engineer.” type responses. This is not bad, it manages to extract the piece about Werner Vogel and “Has achieved noteworthy technical, professional accomplishments while working as an engineer.” The last entry is actually from the principal engineer section.
One downside here is that we already see these special characters like \xa0
and \n
pop up. This tells us that perhaps we’re going too far on the chunking length.
Chunk length 256, chunk overlap 32
I think this chunking length is definitely too long. It pulls the required entries, but also pulls entries from “Fellow”, “Principal Engineer”, and “Senior Staff Engineer”. The first entry is from Distinguished Engineer though, and it covers three points on it.
Chunk length 512, overlap 64
We already established that 256 is probably too long. However, this 512’s first pull is actually the entire section for distinguished engineers. Now we have a dilemma - do we want individual “lines” or “notes” or to pull an entire section? That depends on your use case.
Summary of experimenting with different chunking strategies
Cool, so, we saw five different chunking and chunk overlap strategies in this tutorial. One of the dilemmas we saw from just doing these simple 5 strategies is between getting individual tidbits and an entire section back. We saw that 128 was pretty good for getting individual “lines” or “notes” about distinguished engineers, but that 512 could get our entire section back.
However, 256 wasn’t that good.
These three data points tell us something about chunking. It’s not just that finding an ideal chunking size is difficult. It’s also that you need to think about what you want from your responses when crafting your chunking sizes as well.
Note that we haven’t even gotten to testing out different overlaps. After getting a good chunking strategy down, checking overlaps is the logical next step. Maybe we’ll cover it in a future tutorial, maybe with another library. Stay tuned!