Apify and Zilliz Cloud Integration
Apify and Zilliz Cloud integrate to build web data pipelines for AI applications, combining Apify's web scraping and data extraction platform with over 2,000 ready-made Actors alongside Zilliz Cloud's high-performance vector database for storing, indexing, and searching crawled web content as vector embeddings.
Use this integration for FreeWhat is Apify
Apify is a web scraping and data extraction platform that offers an app marketplace with over two thousand ready-made cloud tools, known as Actors. These tools are ideal for use cases such as extracting structured data from e-commerce websites, social media, search engines, online maps, and more. For example, the Website Content Crawler Actor can deeply crawl websites, clean their HTML by removing cookies modals, footers, or navigation, and then transform the HTML into Markdown.
By integrating with Zilliz Cloud (fully managed Milvus), Apify's web scraping capabilities are connected directly to a scalable vector database through the Apify Milvus integration, enabling crawled web data to be automatically converted into vector embeddings and stored for semantic search, RAG-based question answering, recommendation systems, and AI-powered analytics.
Benefits of the Apify + Zilliz Cloud Integration
- Automated web-to-vector pipeline: Apify crawls and extracts web content, and the Milvus integration automatically chunks, embeds, and stores the data in Zilliz Cloud — creating a seamless pipeline from web data to searchable vectors.
- 2,000+ ready-made Actors: Apify's marketplace provides pre-built tools for crawling e-commerce sites, social media, search engines, and more, all of which can feed data into Zilliz Cloud for vector search.
- Incremental updates: The Apify Milvus integration supports incremental updates, only processing new or modified data based on checksums, keeping your vector database current with minimal processing.
- Automatic outdated data removal: The integration can automatically remove data that hasn't been crawled within a specified time, keeping your Zilliz Cloud collection optimized and up-to-date.
How the Integration Works
Apify serves as the web scraping and data extraction layer, crawling websites using Actors like the Website Content Crawler. It extracts, cleans, and structures web content into text and metadata, storing results in Apify Datasets ready for downstream processing.
Zilliz Cloud serves as the vector database layer, storing and indexing the embedded web content from Apify. Through the
apify/milvus-integrationActor, crawled data is automatically chunked, embedded using OpenAI or other providers, and uploaded to Zilliz Cloud for fast similarity search.Together, Apify and Zilliz Cloud create a complete web data RAG pipeline: Apify's Website Content Crawler extracts and cleans web content, the Milvus integration Actor chunks and embeds the text, and stores it in Zilliz Cloud. Applications can then use LangChain to retrieve relevant documents from Zilliz Cloud and generate informed answers using an LLM — with incremental updates keeping the pipeline current.
Step-by-Step Guide
1. Install Dependencies
$ pip install --upgrade --quiet apify==1.7.2 langchain-core==0.3.5 langchain-milvus==0.1.5 langchain-openai==0.2.02. Set Up API Keys
import os from getpass import getpass os.environ["APIFY_API_TOKEN"] = getpass("Enter YOUR APIFY_API_TOKEN") os.environ["OPENAI_API_KEY"] = getpass("Enter YOUR OPENAI_API_KEY")3. Set Up Milvus/Zilliz Cloud Connection
os.environ["MILVUS_URI"] = getpass("Enter YOUR MILVUS_URI") os.environ["MILVUS_TOKEN"] = getpass("Enter YOUR MILVUS_TOKEN") MILVUS_COLLECTION_NAME = "apify"4. Crawl Websites with Apify's Website Content Crawler
Use the Apify Python SDK to crawl the Milvus documentation:
from apify_client import ApifyClient client = ApifyClient(os.getenv("APIFY_API_TOKEN")) actor_id = "apify/website-content-crawler" run_input = { "crawlerType": "cheerio", "maxCrawlPages": 10, "startUrls": [{"url": "https://milvus.io/"}, {"url": "https://zilliz.com/"}], } actor_call = client.actor(actor_id).call(run_input=run_input)5. Upload Crawled Data to Zilliz Cloud
Use the Apify Milvus integration to chunk, embed, and store the data:
milvus_integration_inputs = { "milvusUri": os.getenv("MILVUS_URI"), "milvusToken": os.getenv("MILVUS_TOKEN"), "milvusCollectionName": MILVUS_COLLECTION_NAME, "datasetFields": ["text", "metadata.title"], "datasetId": actor_call["defaultDatasetId"], "performChunking": True, "embeddingsApiKey": os.getenv("OPENAI_API_KEY"), "embeddingsProvider": "OpenAI", } actor_call = client.actor("apify/milvus-integration").call( run_input=milvus_integration_inputs )6. Build the RAG Pipeline and Ask Questions
Define the retrieval-augmented pipeline using LangChain:
from langchain_core.output_parsers import StrOutputParser from langchain_core.prompts import PromptTemplate from langchain_core.runnables import RunnablePassthrough from langchain_milvus.vectorstores import Milvus from langchain_openai import ChatOpenAI, OpenAIEmbeddings embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Milvus( connection_args={ "uri": os.getenv("MILVUS_URI"), "token": os.getenv("MILVUS_TOKEN"), }, embedding_function=embeddings, collection_name=MILVUS_COLLECTION_NAME, ) prompt = PromptTemplate( input_variables=["context", "question"], template="Use the following pieces of retrieved context to answer the question. If you don't know the answer, " "just say that you don't know. \nQuestion: {question} \nContext: {context} \nAnswer:", ) def format_docs(docs): return "\n\n".join(doc.page_content for doc in docs) rag_chain = ( { "context": vectorstore.as_retriever() | format_docs, "question": RunnablePassthrough(), } | prompt | ChatOpenAI(model="gpt-4o-mini") | StrOutputParser() ) question = "What is Milvus database?" rag_chain.invoke(question)Learn More
- Crawling Websites with Apify and Saving Data to Milvus — Official Milvus tutorial for RAG with Apify
- A Beginner's Guide to Website Chunking and Embedding for RAG Applications — Zilliz tutorial on website chunking for RAG
- Apify Milvus Integration — Apify Actor for Milvus/Zilliz Cloud integration
- Apify Documentation — Official Apify platform documentation
- Website Content Crawler — Apify's Website Content Crawler Actor


