Build RAG Chatbot with Haystack, Haystack In-memory store, Mistral 7B, and AmazonBedrock cohere embed-multilingual-v3

Introduction to RAG

Retrieval-Augmented Generation (RAG) is a game-changer for GenAI applications, especially in conversational AI. It combines the power of pre-trained large language models (LLMs) like OpenAI’s GPT with external knowledge sources stored in vector databases such as Milvus and Zilliz Cloud, allowing for more accurate, contextually relevant, and up-to-date response generation. A RAG pipeline usually consists of four basic components: a vector database, an embedding model, an LLM, and a framework.

Key Components We'll Use for This RAG Chatbot

This tutorial shows you how to build a simple RAG chatbot in Python using the following components:

Haystack: An open-source Python framework designed for building production-ready NLP applications, particularly question answering and semantic search systems. Haystack excels at retrieving information from large document collections through its modular architecture that combines retrieval and reader components. Ideal for developers creating search applications, chatbots, and knowledge management systems that require efficient document processing and accurate information extraction from unstructured text.
Haystack in-memory store: a very simple, in-memory document store with no extra services or dependencies. It is great for experimenting with Haystack, and we do not recommend using it for production. If you want a much more scalable solution for your apps or even enterprise projects, we recommend using Zilliz Cloud, which is a fully managed vector database service built on the open-source Milvusand offers a free tier supporting up to 1 million vectors.)
Mistral 7B: A 7-billion parameter open-source language model optimized for efficiency and versatility in natural language processing. It excels in text generation, summarization, and question answering, balancing performance with lower computational demands. Ideal for chatbots, content creation, code generation, and real-time applications where resource efficiency and rapid deployment are critical.
AmazonBedrock Cohere Embed-Multilingual-v3: A multilingual text embedding model hosted on Amazon Bedrock designed to generate high-dimensional vector representations (1024 dimensions) for text in over 100 languages. It excels at semantic understanding, cross-lingual retrieval, and scalability, making it ideal for multilingual search, content recommendation, clustering, and retrieval-augmented generation (RAG) systems requiring broad language support and semantic accuracy.

By the end of this tutorial, you’ll have a functional chatbot capable of answering questions based on a custom knowledge base.

Note: Since we may use proprietary models in our tutorials, make sure you have the required API key beforehand.

Step 1: Install and Set Up Haystack

import os
import requests

from haystack import Pipeline
from haystack.components.converters import MarkdownToDocument

from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter

Step 2: Install and Set Up Mistral 7B

To use Mistral models, you need first to get a Mistral API key. You can write this key in:

The api_key init parameter using Secret API
The MISTRAL_API_KEY environment variable (recommended)

Now, after you get the API key, let's install the Install the mistral-haystack package.

pip install mistral-haystack

from haystack_integrations.components.generators.mistral import MistralChatGenerator
from haystack.components.generators.utils import print_streaming_chunk
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret

generator = MistralChatGenerator(api_key=Secret.from_env_var("MISTRAL_API_KEY"), streaming_callback=print_streaming_chunk, model='open-mistral-7b')

Step 3: Install and Set Up AmazonBedrock cohere embed-multilingual-v3

Amazon Bedrock is a fully managed service that makes high-performing foundation models from leading AI startups and Amazon available through a unified API.

To use embedding models on Amazon Bedrock for text and document embedding together with Haystack, you need to initialize an AmazonBedrockTextEmbedder and AmazonBedrockDocumentEmbedderwith the model name, the AWS credentials (aws_access_key_id, aws_secret_access_key, and aws_region_name) should be set as environment variables, be configured as described above or passed as Secret arguments. Note, make sure the region you set supports Amazon Bedrock.

Now, let's start installing and setting up models with Amazon Bedrock.

pip install amazon-bedrock-haystack

import os
from haystack_integrations.components.embedders.amazon_bedrock import AmazonBedrockTextEmbedder
from haystack_integrations.components.embedders.amazon_bedrock import AmazonBedrockDocumentEmbedder
from haystack.dataclasses import Document

os.environ["AWS_ACCESS_KEY_ID"] = "..."
os.environ["AWS_SECRET_ACCESS_KEY"] = "..."
os.environ["AWS_DEFAULT_REGION"] = "us-east-1" # just an example

text_embedder = AmazonBedrockTextEmbedder(model="cohere.embed-multilingual-v3",
                                                                                                                                                                        input_type="search_query"

document_embedder = AmazonBedrockDocumentEmbedder(model="cohere.embed-multilingual-v3",
                                                                                                                                                                        input_type="search_document"

Step 4: Install and Set Up Haystack In-memory store

from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers import InMemoryEmbeddingRetriever

document_store = InMemoryDocumentStore()
retriever=InMemoryEmbeddingRetriever(document_store=document_store))

Step 5: Build a RAG Chatbot

Now that you’ve set up all components, let’s start to build a simple chatbot. We’ll use the Milvus introduction doc as a private knowledge base. You can replace it your own dataset to customize your RAG chatbot.

url = 'https://raw.githubusercontent.com/milvus-io/milvus-docs/refs/heads/v2.5.x/site/en/about/overview.md'
example_file = 'example_file.md'
response = requests.get(url)
with open(example_file, 'wb') as f:
    f.write(response.content)
file_paths = [example_file]  # You can replace it with your own file paths.

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("converter", MarkdownToDocument())
indexing_pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=2))
indexing_pipeline.add_component("embedder", document_embedder)
indexing_pipeline.add_component("writer", DocumentWriter(document_store))
indexing_pipeline.connect("converter", "splitter")
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder", "writer")
indexing_pipeline.run({"converter": {"sources": file_paths}})

# print("Number of documents:", document_store.count_documents())
  
question = "What is Milvus?"  # You can replace it with your own question.

retrieval_pipeline = Pipeline()
retrieval_pipeline.add_component("embedder", text_embedder)
retrieval_pipeline.add_component("retriever", retriever)
retrieval_pipeline.connect("embedder", "retriever")

retrieval_results = retrieval_pipeline.run({"embedder": {"text": question}})

# for doc in retrieval_results["retriever"]["documents"]:
#     print(doc.content)
#     print("-" * 10)

from haystack.utils import Secret
from haystack.components.builders import PromptBuilder

retriever=InMemoryEmbeddingRetriever(document_store=document_store) 
 text_embedder = AmazonBedrockTextEmbedder(model="cohere.embed-multilingual-v3",
                                                                                                                                                                        input_type="search_query"

prompt_template = """Answer the following query based on the provided context. If the context does
                     not include an answer, reply with 'I don't know'.\n
                     Query: {{query}}
                     Documents:
                     {% for doc in documents %}
                        {{ doc.content }}
                     {% endfor %}
                     Answer: 
                  """

rag_pipeline = Pipeline()
rag_pipeline.add_component("text_embedder", text_embedder)
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", PromptBuilder(template=prompt_template))
rag_pipeline.add_component("generator", generator)
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "generator")

results = rag_pipeline.run({"text_embedder": {"text": question}, "prompt_builder": {"query": question},})
print('RAG answer:\n', results["generator"]["replies"][0])

Optimization Tips

As you build your RAG system, optimization is key to ensuring peak performance and efficiency. While setting up the components is an essential first step, fine-tuning each one will help you create a solution that works even better and scales seamlessly. In this section, we’ll share some practical tips for optimizing all these components, giving you the edge to build smarter, faster, and more responsive RAG applications.

Haystack optimization tips

To optimize Haystack in a RAG setup, ensure you use an efficient retriever like FAISS or Milvus for scalable and fast similarity searches. Fine-tune your document store settings, such as indexing strategies and storage backends, to balance speed and accuracy. Use batch processing for embedding generation to reduce latency and optimize API calls. Leverage Haystack's pipeline caching to avoid redundant computations, especially for frequently queried documents. Tune your reader model by selecting a lightweight yet accurate transformer-based model like DistilBERT to speed up response times. Implement query rewriting or filtering techniques to enhance retrieval quality, ensuring the most relevant documents are retrieved for generation. Finally, monitor system performance with Haystack’s built-in evaluation tools to iteratively refine your setup based on real-world query performance.

Haystack in-memory store optimization tips

Haystack in-memory store is just a very simple, in-memory document store with no extra services or dependencies. We recommend that you just experiment it with RAG pipeline within your Haystack framework, and we do not recommend using it for production. If you want a much more scalable solution for your apps or even enterprise projects, we recommend using Zilliz Cloud, which is a fully managed vector database service built on the open-source Milvusand offers a free tier supporting up to 1 million vectors

Mistral 7B optimization tips

To enhance Mistral 7B's performance in RAG, prioritize prompt engineering with concise, structured instructions and few-shot examples to guide outputs. Use smaller text chunks (256-512 tokens) for retrieval to reduce noise and improve relevance. Fine-tune Mistral 7B on domain-specific data using LoRA for efficient adaptation. Enable 4-bit quantization via Hugging Face’s bitsandbytes to reduce memory usage without significant accuracy loss. Adjust temperature (0.1-0.3) and top-p (0.9-0.95) for balanced creativity and precision. Cache frequent queries and precompute embeddings to accelerate inference.

AmazonBedrock cohere embed-multilingual-v3 optimization tips

Optimize input preprocessing by normalizing text (lowercasing, removing special characters) and splitting documents into chunks aligned with the model’s 512-token limit. Use batch processing for bulk embeddings to reduce latency and costs. Filter irrelevant content before embedding to improve retrieval quality. For multilingual queries, ensure language-specific stopword removal and consider hybrid retrieval combining semantic and keyword search. Regularly validate embedding quality via cosine similarity checks and align vector dimensions with your database (e.g., PCA for dimensionality reduction). Cache frequent queries and update embeddings periodically to reflect data changes.

By implementing these tips across your components, you'll be able to enhance the performance and functionality of your RAG system, ensuring it’s optimized for both speed and accuracy. Keep testing, iterating, and refining your setup to stay ahead in the ever-evolving world of AI development.

RAG Cost Calculator: A Free Tool to Calculate Your Cost in Seconds

Estimating the cost of a Retrieval-Augmented Generation (RAG) pipeline involves analyzing expenses across vector storage, compute resources, and API usage. Key cost drivers include vector database queries, embedding generation, and LLM inference.

RAG Cost Calculator is a free tool that quickly estimates the cost of building a RAG pipeline, including chunking, embedding, vector storage/search, and LLM generation. It also helps you identify cost-saving opportunities and achieve up to 10x cost reduction on vector databases with the serverless option.

Calculate your RAG cost now.

Calculate your RAG cost

What Have You Learned?

By diving into this tutorial, you’ve unlocked the magic of building a RAG system from scratch using cutting-edge tools! You learned how Haystack acts as the backbone, seamlessly connecting every piece of the puzzle. With Haystack’s In-Memory Store, you saw how to efficiently manage and retrieve vectorized data without the overhead of external databases—perfect for rapid prototyping. Then came Mistral 7B, the powerhouse language model that turns retrieved information into human-like responses, balancing speed and accuracy. But what ties it all together? Amazon Bedrock’s Cohere embed-multilingual-v3 embedding model, which transforms text into rich, multilingual vectors, ensuring your system understands context across languages. Together, these tools create a dynamic RAG pipeline that’s both flexible and scalable, ready to tackle everything from customer support bots to research assistants. You even picked up pro tips like optimizing chunk sizes for better retrieval and tweaking model parameters for cost-performance balance—plus, that free RAG cost calculator you explored? A game-changer for budgeting real-world deployments!

Now imagine what you can build next! Whether you’re refining your pipeline’s accuracy, experimenting with multilingual datasets, or integrating custom data sources, you’ve got the tools to innovate. This tutorial wasn’t just about following steps—it was about empowering you to think bigger. So fire up your code editor, play with those optimization tricks, and let your creativity run wild. The world of AI-driven applications is yours to shape, and with RAG, you’re not just building systems—you’re crafting solutions that think, learn, and adapt. Ready to make an impact? Start tinkering, share your creations, and watch your ideas come to life. The future is waiting—let’s build it together! 🚀

Further Resources

🌟 In addition to this RAG tutorial, unleash your full potential with these incredible resources to level up your RAG skills.

How to Build a Multimodal RAG | Documentation
How to Enhance the Performance of Your RAG Pipeline
Graph RAG with Milvus | Documentation
How to Evaluate RAG Applications - Zilliz Learn
Generative AI Resource Hub | Zilliz

We'd Love to Hear What You Think!

We’d love to hear your thoughts! 🌟 Leave your questions or comments below or join our vibrant Milvus Discord community to share your experiences, ask questions, or connect with thousands of AI enthusiasts. Your journey matters to us!

If you like this tutorial, show your support by giving our Milvus GitHub repo a star ⭐—it means the world to us and inspires us to keep creating! 💖