Unstructured and Zilliz Cloud Integration
Unstructured and Zilliz Cloud integrate to streamline unstructured data processing for AI applications, combining Unstructured's platform for ingesting and transforming diverse document formats into AI-ready embeddings with Zilliz Cloud's high-performance vector database for scalable storage, indexing, and retrieval.
Use this integration for FreeWhat is Unstructured
Unstructured provides a platform and tools to ingest and process unstructured documents for Retrieval Augmented Generation (RAG) and model fine-tuning. It supports diverse file formats including text documents, images, PDFs, and presentations, and offers both a no-code UI platform and serverless API services, enabling users to quickly prepare data for downstream storage, analysis, and machine learning workflows with vector databases and LLM frameworks.
By integrating with Zilliz Cloud (fully managed Milvus), Unstructured creates a powerful, scalable solution for managing and leveraging unstructured data in AI applications — transforming various file types into AI-ready vector embeddings that are stored, indexed, and retrieved at scale for RAG, chatbots, and recommendation systems.
Benefits of the Unstructured + Zilliz Cloud Integration
- Diverse format support: Unstructured processes PDFs, HTML, images, presentations, and more, transforming them into chunked, embedded data ready for storage in Zilliz Cloud's vector database.
- Intelligent partitioning and chunking: Unstructured's partitioning engine uses strategies like "hi_res" for high-quality extraction and "by_title" for semantic chunking, producing well-structured chunks optimized for vector search in Zilliz Cloud.
- No-code and API options: Unstructured offers both a no-code UI platform and serverless API, giving teams flexibility to process data without writing code or programmatically in Python, with results flowing into Zilliz Cloud.
- Scalable AI-ready pipeline: The combination handles billion-scale vector storage and rapid similarity search, supporting production-grade RAG applications that leverage processed unstructured data.
How the Integration Works
Unstructured serves as the data processing layer, handling initial document ingestion, partitioning, and chunking. It reads diverse file formats, extracts content and metadata using configurable strategies, and produces structured text chunks ready for embedding and downstream use.
Zilliz Cloud serves as the vector database layer, storing and indexing the embedded document chunks from Unstructured. It provides high-performance similarity search with low latency, enabling applications to retrieve the most relevant content from large collections of processed documents.
Together, Unstructured and Zilliz Cloud create a complete document-to-search pipeline: Unstructured ingests and partitions documents into semantically meaningful chunks, these chunks are embedded using models like OpenAI's text-embedding-3-small and stored in Zilliz Cloud with metadata. When a user submits a query, Zilliz Cloud retrieves the most relevant chunks through similarity search, and the context is passed to an LLM to generate informed responses.
Step-by-Step Guide
1. Install Dependencies
$ pip install -qU "unstructured[pdf]" pymilvus openaiFor processing all document formats:
pip install "unstructured[all-docs]". For specific formats (e.g., PDF):pip install "unstructured[pdf]". For more installation options, see the Unstructured documentation.2. Set Up Environment and Clients
Prepare the OpenAI API key and initialize Milvus and OpenAI clients:
import os from pymilvus import MilvusClient, DataType from openai import OpenAI os.environ["OPENAI_API_KEY"] = "sk-***********" milvus_client = MilvusClient(uri="./milvus_demo.db") openai_client = OpenAI() def emb_text(text): return ( openai_client.embeddings.create(input=text, model="text-embedding-3-small") .data[0] .embedding )As for the argument of
MilvusClient: Setting theurias a local file, e.g../milvus.db, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file. If you have large scale of data, you can set up a more performant Milvus server on Docker or Kubernetes. If you want to use Zilliz Cloud, the fully managed cloud service for Milvus, adjust theuriandtoken, which correspond to the Public Endpoint and API Key in Zilliz Cloud.3. Create Milvus Collection
Create a collection with schema and index:
collection_name = "my_rag_collection" if milvus_client.has_collection(collection_name): milvus_client.drop_collection(collection_name) test_embedding = emb_text("This is a test") embedding_dim = len(test_embedding) schema = milvus_client.create_schema(auto_id=False, enable_dynamic_field=False) schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True) schema.add_field(field_name="vector", datatype=DataType.FLOAT_VECTOR, dim=embedding_dim) schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=65535) schema.add_field(field_name="metadata", datatype=DataType.JSON) index_params = MilvusClient.prepare_index_params() index_params.add_index( field_name="vector", metric_type="COSINE", index_type="AUTOINDEX", ) milvus_client.create_collection( collection_name=collection_name, schema=schema, index_params=index_params, consistency_level="Strong", ) milvus_client.load_collection(collection_name=collection_name)4. Load Data from Unstructured
Use Unstructured to partition and chunk a local PDF file:
import warnings from unstructured.partition.auto import partition warnings.filterwarnings("ignore") elements = partition( filename="./pdf_files/WhatisMilvus.pdf", strategy="hi_res", chunking_strategy="by_title", )5. Insert Data into Milvus
Embed the partitioned elements and insert them into the Milvus collection:
data = [] for i, element in enumerate(elements): data.append( { "id": i, "vector": emb_text(element.text), "text": element.text, "metadata": element.metadata.to_dict(), } ) milvus_client.insert(collection_name=collection_name, data=data)6. Retrieve and Generate RAG Response
Define retrieval and generation functions, then test the pipeline:
def retrieve_documents(question, top_k=3): search_res = milvus_client.search( collection_name=collection_name, data=[emb_text(question)], limit=top_k, output_fields=["text"], ) return [(res["entity"]["text"], res["distance"]) for res in search_res[0]] def generate_rag_response(question): retrieved_docs = retrieve_documents(question) context = "\n".join([f"Text: {doc[0]}\n" for doc in retrieved_docs]) system_prompt = ( "You are an AI assistant. Provide answers based on the given context." ) user_prompt = f""" Use the following pieces of information to answer the question. If the information is not in the context, say you don't know. Context: {context} Question: {question} """ response = openai_client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}, ], ) return response.choices[0].message.content question = "What is the Advanced Search Algorithms in Milvus?" answer = generate_rag_response(question) print(f"Question: {question}") print(f"Answer: {answer}")Learn More
- Build a RAG with Milvus and Unstructured — Official Milvus tutorial for building RAG with Unstructured
- Choosing ETL Tools for Unstructured Data to Get AI-ready — Zilliz blog on ETL tools for unstructured data
- Building Multimodal AI Pipelines: A Guide to Unstructured Data — Zilliz blog on multimodal data pipelines
- Unstructured Documentation — Official Unstructured documentation
- Unstructured GitHub Repository — Unstructured source code and community resources