Fireworks AI and Zilliz Cloud Integration
Fireworks AI and Zilliz Cloud integrate to build high-performance RAG applications, combining Fireworks AI's generative AI inference platform with industry-leading speed for LLM and embedding models alongside Zilliz Cloud's scalable vector database for efficient vector storage and similarity search.
Use this integration for FreeWhat is Fireworks AI
Fireworks AI is a generative AI inference platform offering industry-leading speed and production-readiness for running and customizing models. It provides a variety of generative AI services, including serverless models, on-demand deployments, and fine-tuning capabilities across text, audio, image, and embedding models. Fireworks AI aggregates numerous models with pay-as-you-go pricing, JSON mode, grammar mode, and function calling capabilities, enabling users to easily access and utilize these resources without extensive infrastructure setup.
By integrating with Zilliz Cloud (fully managed Milvus), Fireworks AI's optimized LLM and embedding models are paired with a scalable vector database, providing a robust foundation for building production-ready RAG applications where Zilliz Cloud retrieves the most relevant documents based on vector similarity and Fireworks AI's LLMs generate accurate, contextual responses.
Benefits of the Fireworks AI + Zilliz Cloud Integration
- Industry-leading inference speed: Fireworks AI delivers high-performance LLM and embedding model inference, while Zilliz Cloud provides fast vector retrieval, creating an end-to-end low-latency RAG pipeline.
- OpenAI-compatible API: Fireworks AI enables the OpenAI-style API, making it easy to switch between providers without code changes while Zilliz Cloud stores and retrieves the resulting embeddings.
- Diverse model selection: Fireworks AI aggregates numerous models including Llama 3.1, DeepSeek, and embedding models like nomic-embed-text, giving developers flexibility to choose the best models for their use case with Zilliz Cloud as the vector backend.
- Production-ready without infrastructure complexity: Both platforms are fully managed, enabling developers to build and deploy RAG applications without managing complex infrastructure for model serving or vector storage.
How the Integration Works
Fireworks AI serves as the AI inference platform, providing both embedding models (e.g., nomic-ai/nomic-embed-text-v1.5) for generating vector representations of text and LLMs (e.g., llama-v3p1-405b-instruct) for generating contextual responses. It uses an OpenAI-compatible API for seamless integration.
Zilliz Cloud serves as the vector database layer, storing and indexing the embeddings generated by Fireworks AI's models. It provides high-performance similarity search using inner product distance, enabling fast retrieval of the most relevant documents from large collections.
Together, Fireworks AI and Zilliz Cloud create a complete RAG solution: documents are embedded using Fireworks AI's embedding models and stored in Zilliz Cloud. When a user asks a question, Fireworks AI embeds the query, Zilliz Cloud retrieves the most relevant context through similarity search, and Fireworks AI's LLM generates an accurate, contextual response based on the retrieved documents.
Step-by-Step Guide
1. Install Dependencies
$ pip install --upgrade pymilvus openai requests tqdm2. Prepare the API Key
Fireworks AI enables the OpenAI-style API. Prepare the API key as an environment variable:
import os os.environ["FIREWORKS_API_KEY"] = "***********"3. Prepare the Data
We use the FAQ pages from the Milvus Documentation 2.4.x as the private knowledge in our RAG:
$ wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip $ unzip -q milvus_docs_2.4.x_en.zip -d milvus_docsfrom glob import glob text_lines = [] for file_path in glob("milvus_docs/en/faq/*.md", recursive=True): with open(file_path, "r") as file: file_text = file.read() text_lines += file_text.split("# ")4. Prepare the LLM and Embedding Model
Initialize a client using the OpenAI-compatible API and define the embedding function using the
nomic-ai/nomic-embed-text-v1.5model:from openai import OpenAI fireworks_client = OpenAI( api_key=os.environ["FIREWORKS_API_KEY"], base_url="https://api.fireworks.ai/inference/v1", ) def emb_text(text): return ( fireworks_client.embeddings.create( input=text, model="nomic-ai/nomic-embed-text-v1.5" ) .data[0] .embedding )Generate a test embedding and print its dimension:
test_embedding = emb_text("This is a test") embedding_dim = len(test_embedding) print(embedding_dim) print(test_embedding[:10])5. Create a Milvus Collection and Insert Data
from pymilvus import MilvusClient from tqdm import tqdm milvus_client = MilvusClient(uri="./milvus_demo.db") collection_name = "my_rag_collection" if milvus_client.has_collection(collection_name): milvus_client.drop_collection(collection_name) milvus_client.create_collection( collection_name=collection_name, dimension=embedding_dim, metric_type="IP", consistency_level="Strong", ) data = [] for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")): data.append({"id": i, "vector": emb_text(line), "text": line}) milvus_client.insert(collection_name=collection_name, data=data)As for the argument of
MilvusClient: Setting theurias a local file, e.g../milvus.db, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file. If you have large scale of data, you can set up a more performant Milvus server on Docker or Kubernetes. If you want to use Zilliz Cloud, the fully managed cloud service for Milvus, adjust theuriandtoken, which correspond to the Public Endpoint and API Key in Zilliz Cloud.6. Build RAG — Retrieve and Generate Response
Search for the question in the collection and retrieve the semantic top-3 matches:
question = "How is data stored in milvus?" search_res = milvus_client.search( collection_name=collection_name, data=[emb_text(question)], limit=3, search_params={"metric_type": "IP", "params": {}}, output_fields=["text"], )Use the
llama-v3p1-405b-instructmodel provided by Fireworks to generate a response:context = "\n".join( [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances] ) SYSTEM_PROMPT = """ Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided. """ USER_PROMPT = f""" Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags. <context> {context} </context> <question> {question} </question> """ response = fireworks_client.chat.completions.create( model="accounts/fireworks/models/llama-v3p1-405b-instruct", messages=[ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": USER_PROMPT}, ], ) print(response.choices[0].message.content)Learn More
- Build RAG with Milvus and Fireworks AI — Official Milvus tutorial for building RAG with Fireworks AI
- Build RAG Chatbot with LangChain, Milvus, and Fireworks AI — Zilliz RAG tutorial with Fireworks AI
- Fireworks AI Documentation — Official Fireworks AI documentation
- Fireworks AI API Reference — Fireworks AI getting started guide
- Milvus Quickstart — Milvus quickstart documentation


