Integrations
Unstructured and Zilliz Cloud Integration

Unstructured and Zilliz Cloud Integration

Unstructured and Zilliz Cloud integrate to streamline unstructured data processing for AI applications, combining Unstructured's platform for ingesting and transforming diverse document formats into AI-ready embeddings with Zilliz Cloud's high-performance vector database for scalable storage, indexing, and retrieval.

Use this integration for Free

What is Unstructured
Unstructured provides a platform and tools to ingest and process unstructured documents for Retrieval Augmented Generation (RAG) and model fine-tuning. It supports diverse file formats including text documents, images, PDFs, and presentations, and offers both a no-code UI platform and serverless API services, enabling users to quickly prepare data for downstream storage, analysis, and machine learning workflows with vector databases and LLM frameworks.

By integrating with Zilliz Cloud (fully managed Milvus), Unstructured creates a powerful, scalable solution for managing and leveraging unstructured data in AI applications — transforming various file types into AI-ready vector embeddings that are stored, indexed, and retrieved at scale for RAG, chatbots, and recommendation systems.
Benefits of the Unstructured + Zilliz Cloud Integration
- Diverse format support: Unstructured processes PDFs, HTML, images, presentations, and more, transforming them into chunked, embedded data ready for storage in Zilliz Cloud's vector database.
- Intelligent partitioning and chunking: Unstructured's partitioning engine uses strategies like "hi_res" for high-quality extraction and "by_title" for semantic chunking, producing well-structured chunks optimized for vector search in Zilliz Cloud.
- No-code and API options: Unstructured offers both a no-code UI platform and serverless API, giving teams flexibility to process data without writing code or programmatically in Python, with results flowing into Zilliz Cloud.
- Scalable AI-ready pipeline: The combination handles billion-scale vector storage and rapid similarity search, supporting production-grade RAG applications that leverage processed unstructured data.
How the Integration Works
Unstructured serves as the data processing layer, handling initial document ingestion, partitioning, and chunking. It reads diverse file formats, extracts content and metadata using configurable strategies, and produces structured text chunks ready for embedding and downstream use.

Zilliz Cloud serves as the vector database layer, storing and indexing the embedded document chunks from Unstructured. It provides high-performance similarity search with low latency, enabling applications to retrieve the most relevant content from large collections of processed documents.

Together, Unstructured and Zilliz Cloud create a complete document-to-search pipeline: Unstructured ingests and partitions documents into semantically meaningful chunks, these chunks are embedded using models like OpenAI's text-embedding-3-small and stored in Zilliz Cloud with metadata. When a user submits a query, Zilliz Cloud retrieves the most relevant chunks through similarity search, and the context is passed to an LLM to generate informed responses.

Step-by-Step Guide

1. Install Dependencies

$ pip install -qU "unstructured[pdf]" pymilvus openai

For processing all document formats: pip install "unstructured[all-docs]". For specific formats (e.g., PDF): pip install "unstructured[pdf]". For more installation options, see the Unstructured documentation.

2. Set Up Environment and Clients

Prepare the OpenAI API key and initialize Milvus and OpenAI clients:

import os
from pymilvus import MilvusClient, DataType
from openai import OpenAI

os.environ["OPENAI_API_KEY"] = "sk-***********"

milvus_client = MilvusClient(uri="./milvus_demo.db")
openai_client = OpenAI()


def emb_text(text):
    return (
        openai_client.embeddings.create(input=text, model="text-embedding-3-small")
        .data[0]
        .embedding
    )

As for the argument of MilvusClient: Setting the uri as a local file, e.g. ./milvus.db, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file. If you have large scale of data, you can set up a more performant Milvus server on Docker or Kubernetes. If you want to use Zilliz Cloud, the fully managed cloud service for Milvus, adjust the uri and token, which correspond to the Public Endpoint and API Key in Zilliz Cloud.

3. Create Milvus Collection

Create a collection with schema and index:

collection_name = "my_rag_collection"

if milvus_client.has_collection(collection_name):
    milvus_client.drop_collection(collection_name)

test_embedding = emb_text("This is a test")
embedding_dim = len(test_embedding)

schema = milvus_client.create_schema(auto_id=False, enable_dynamic_field=False)
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
schema.add_field(field_name="vector", datatype=DataType.FLOAT_VECTOR, dim=embedding_dim)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=65535)
schema.add_field(field_name="metadata", datatype=DataType.JSON)

index_params = MilvusClient.prepare_index_params()
index_params.add_index(
    field_name="vector",
    metric_type="COSINE",
    index_type="AUTOINDEX",
)

milvus_client.create_collection(
    collection_name=collection_name,
    schema=schema,
    index_params=index_params,
    consistency_level="Strong",
)

milvus_client.load_collection(collection_name=collection_name)

4. Load Data from Unstructured

Use Unstructured to partition and chunk a local PDF file:

import warnings
from unstructured.partition.auto import partition

warnings.filterwarnings("ignore")

elements = partition(
    filename="./pdf_files/WhatisMilvus.pdf",
    strategy="hi_res",
    chunking_strategy="by_title",
)

5. Insert Data into Milvus

Embed the partitioned elements and insert them into the Milvus collection:

data = []
for i, element in enumerate(elements):
    data.append(
        {
            "id": i,
            "vector": emb_text(element.text),
            "text": element.text,
            "metadata": element.metadata.to_dict(),
        }
    )
milvus_client.insert(collection_name=collection_name, data=data)

6. Retrieve and Generate RAG Response

Define retrieval and generation functions, then test the pipeline:

def retrieve_documents(question, top_k=3):
    search_res = milvus_client.search(
        collection_name=collection_name,
        data=[emb_text(question)],
        limit=top_k,
        output_fields=["text"],
    )
    return [(res["entity"]["text"], res["distance"]) for res in search_res[0]]


def generate_rag_response(question):
    retrieved_docs = retrieve_documents(question)
    context = "\n".join([f"Text: {doc[0]}\n" for doc in retrieved_docs])
    system_prompt = (
        "You are an AI assistant. Provide answers based on the given context."
    )
    user_prompt = f"""
    Use the following pieces of information to answer the question. If the information is not in the context, say you don't know.
    
    Context:
    {context}
    
    Question: {question}
    """
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
    )
    return response.choices[0].message.content


question = "What is the Advanced Search Algorithms in Milvus?"
answer = generate_rag_response(question)
print(f"Question: {question}")
print(f"Answer: {answer}")

Learn More
- Build a RAG with Milvus and Unstructured — Official Milvus tutorial for building RAG with Unstructured
- Choosing ETL Tools for Unstructured Data to Get AI-ready — Zilliz blog on ETL tools for unstructured data
- Building Multimodal AI Pipelines: A Guide to Unstructured Data — Zilliz blog on multimodal data pipelines
- Unstructured Documentation — Official Unstructured documentation
- Unstructured GitHub Repository — Unstructured source code and community resources

Unstructured and Zilliz Cloud Integration

What is Unstructured

Benefits of the Unstructured + Zilliz Cloud Integration

How the Integration Works

Step-by-Step Guide

Learn More

Related Resources

Choosing ETL Tools for Unstructured Data to Get AI-ready

Building Multimodal AI Pipelines: A Guide to Unstructured Data

Introduction to Milvus Vector Database