Airbyte and Zilliz Cloud Integration
Airbyte and Zilliz Cloud integrate to streamline data ingestion for AI applications, combining Airbyte's open-source data movement infrastructure with 350+ pre-built connectors alongside Zilliz Cloud's high-performance vector database for automated data extraction, embedding, and similarity search.
Use this integration for FreeWhat is Airbyte
Airbyte is an open-source data movement infrastructure for building extract and load (EL) data pipelines. It is designed for versatility, scalability, and ease of use. Airbyte's connector catalog comes "out-of-the-box" with over 350 pre-built connectors that can be used to start replicating data from a source to a destination in just a few minutes. It also features a No-Code Connector Builder, a cloud-managed platform, and multiple interfaces including UI, PyAirbyte (Python library), API, and Terraform Provider.
By integrating with Zilliz Cloud (fully managed Milvus), Airbyte enables seamless data flow from hundreds of data sources — including databases, data warehouses, and SaaS products — into a scalable vector database, handling data ingestion, chunking, formatting, vectorization, indexing, and similarity search in an automated pipeline for building RAG applications, semantic search, and recommendation systems.
Benefits of the Airbyte + Zilliz Cloud Integration
- Extensive data source connectivity: Airbyte connects with 350+ data sources including databases, APIs, and SaaS products like Zendesk, Salesforce, and GitHub, enabling seamless data flow into Zilliz Cloud for vector search applications.
- Automated chunking and embedding: The Milvus destination connector automatically splits records into text chunks, generates vector embeddings using pre-trained models (OpenAI, Cohere), and loads them into Zilliz Cloud — no custom code required.
- Incremental sync: Airbyte's "Incremental | Append + Deduped" sync mode keeps Zilliz Cloud up to date with source data changes while transferring minimal data, ensuring your vector database stays current.
- End-to-end data pipeline: From extraction to embedding to indexing, Airbyte handles the complete data movement pipeline, letting developers focus on building applications rather than managing data infrastructure.
- Flexible deployment: Airbyte is available as cloud-managed or self-managed, pairing with Zilliz Cloud's managed vector database for a fully managed stack or with self-hosted Milvus for on-premise requirements.
How the Integration Works
Airbyte serves as the data movement layer, extracting data from configured source connectors (databases, APIs, SaaS products), processing it through chunking and formatting, generating vector embeddings using models like OpenAI's text-embedding-ada-002, and loading the results into the destination vector database.
Zilliz Cloud serves as the vector database destination, storing and indexing the chunked and embedded data from Airbyte. It provides high-performance similarity search, enabling applications to retrieve semantically related content from the ingested data.
Together, Airbyte and Zilliz Cloud create an automated data-to-search pipeline: Airbyte extracts data from any of its 350+ source connectors, chunks the text, generates embeddings, and loads them into Zilliz Cloud. Applications can then perform similarity search on this data to power use cases like semantic customer support search, RAG-based chatbots, recommendation engines, and knowledge base retrieval.
Step-by-Step Guide
1. Prerequisites
You will need:
- A data source account (e.g., Zendesk) or another source you want to sync data from
- An Airbyte account or local instance
- An OpenAI API key
- A Milvus cluster
- Python 3.10 installed locally
2. Set Up Milvus Cluster
If you have already deployed a K8s cluster for production, you can skip this step and proceed directly to deploy Milvus Operator. If not, you can follow the steps to deploy a Milvus cluster with Milvus Operator.
Create a collection with a suitable name and set the Dimension to 1536 to match the vector dimensionality generated by the OpenAI embeddings service. After creation, record the endpoint and authentication info.
3. Set Up Source in Airbyte
Sign up for an Airbyte cloud account at cloud.airbyte.com or fire up a local instance. Click "New connection" and pick your source connector (e.g., "Zendesk Support"). After clicking "Test and Save," Airbyte will check whether the connection can be established.
4. Set Up Milvus Destination in Airbyte
Pick the "Milvus" connector as the destination. The Milvus connector performs three functions:
- Chunking and Formatting — Splits records into text and metadata. If text is larger than the specified chunk size, records are split into multiple parts loaded individually.
- Embedding — Using machine learning models, transforms text chunks into vector embeddings. Supply the OpenAI API key and Airbyte will send each chunk to OpenAI and add the resulting vector.
- Indexing — Loads vectorized chunks into your Milvus cluster. Insert the endpoint and authentication information from your cluster setup.
5. Set Up Stream Sync Flow
Select which "streams" to sync (e.g., "tickets" and "articles" for Zendesk). Choose the "Incremental | Append + Deduped" sync mode so that subsequent runs keep your data in sync while transferring minimal data.
6. Build a Streamlit Application
Install the required packages and build a support form application that queries the Milvus collection:
pip install streamlit pymilvus openaiimport streamlit as st import os import pymilvus import openai with st.form("my_form"): st.write("Submit a support case") text_val = st.text_area("Describe your problem?") submitted = st.form_submit_button("Submit") if submitted: org_id = 360033549136 pymilvus.connections.connect(uri=os.environ["MILVUS_URL"], token=os.environ["MILVUS_TOKEN"]) collection = pymilvus.Collection("zendesk") embedding = openai.Embedding.create(input=text_val, model="text-embedding-ada-002")['data'][0]['embedding'] results = collection.search(data=[embedding], anns_field="vector", param={}, limit=2, output_fields=["_id", "subject", "description"], expr=f'status == "new" and organization_id == {org_id}') st.write(results[0]) if len(results[0]) > 0 and results[0].distances[0] < 0.35: matching_ticket = results[0][0].entity st.write(f"This case seems very similar to {matching_ticket.get('subject')} (id #{matching_ticket.get('_id')}). Make sure it has not been submitted before") else: st.write("Submitted!")Run the application:
export MILVUS_TOKEN=... export MILVUS_URL=https://... export OPENAI_API_KEY=sk-... streamlit run app.pyLearn More
- Airbyte: Open-Source Data Movement Infrastructure — Official Milvus tutorial for integrating with Airbyte
- Use Milvus and Airbyte for Similarity Search on All Your Data — Zilliz blog on similarity search with Airbyte
- Higher Data Flow Efficiency with Zilliz Upsert, Kafka, and Airbyte Integration — Zilliz blog on Airbyte integration
- Airbyte Documentation — Official Airbyte documentation
- Airbyte Milvus Destination Connector — Airbyte Milvus connector documentation