Blog
Comparing Llama 2 Chat and ChatGPT: How They Perform in Question Answering

Comparing Llama 2 Chat and ChatGPT: How They Perform in Question Answering

Sep 13, 202310 min read

The AI community has been excited about Meta AI's recent release of Llama 2. Unlike its closed-source counterpart, ChatGPT, Llama 2 is open-source and available for free use in commercial applications. With its impressive scale and competitive answer quality, Llama 2 will make significant waves in artificial intelligence.

In this post, we will briefly introduce Llama 2 and evaluate its performance in answering questions compared to ChatGPT.

What is Llama 2?

Llama 2 is Meta AI's latest open-source large language model (LLM), developed in response to OpenAI’s GPT models and Google’s PaLM 2 model. Llama 2 is free for anyone to use for research or commercial purposes.

Llama 2 comes in three sizes, boasting an impressive 70 billion, 130 billion, and 700 billion parameters. This variety allows developers to choose the model that best suits their project's requirements. Llama 2's pre-trained models are trained on 2 trillion tokens and support context lengths of up to 4096 tokens, equivalent to GPT-3.

Llama Chat, the fine-tuned model of Llama 2, has been trained on over 1 million human annotations and is specifically tailored for conversational AI scenarios. This model is attractive for those looking to build chatbots, virtual assistants, or any AI-powered conversational system.

Llama 2 Hallucination and RAG

Like many other LLMs, Llama 2 has been pre-trained on publicly available data and faces hallucination issues when needing more relevant knowledge. The AI community has introduced the Retrieval Augmented Generation (RAG) concept to address these challenges. RAG leverages externally retrieved knowledge as context to enhance the accuracy of the model-generated responses.

How RAG works with a Towhee pipeline

Many tools are available in the market to enhance LLMs with external knowledge. Towhee is one of the primary examples. It is an open-source machine learning pipeline that integrates with Llama 2 in its architecture and supports flexible data processing. This section will use a small example from the Towhee Pipeline to demonstrate how the RAG solution works.

If you ask Llama 2 "how to install Towhee," it will provide you with factually incorrect, irrelevant, or nonsensical answers since Llama 2 doesn't know anything about Towhee.

To address this issue, we use a Towhee pipeline to combine additional information about Towhee and the chat history with the original question for Llama 2. Llama 2 will then provide a more accurate answer based on the news.

Below is the augmented prompt for Llama 2.

from towhee import pipe, ops


p = (
pipe.input('question', 'docs', 'history')
.map(('question', 'docs', 'history'), 'prompt', ops.prompt.question_answer())
.map('prompt', 'answer', ops.LLM.Llama_2('llama-2-13b-chat'))
.output('answer')
)


# Retrieve chat history and additional knowledge.
history=[('What is Towhee?', 'Towhee is a cutting-edge framework designed to streamline the processing of unstructured data through the use of Large Language Model (LLM) based pipeline orchestration.')]
knowledge = ['You can install towhee via `pip install towhee`.']


# Query with retrieved info.
question = 'How to install it?'
answer = p(question, knowledge, history).get()[0]
# Example answer is：
# To install Towhee, you can use the following command in your terminal or
# command prompt:\n```\npip install towhee\n\n```\n\n\n\n'

To learn more about building complex RAG-based chatbots using Towhee, check out Akcio, an open-source demonstration project for retrieval augmented generation.

Evaluating Llama 2 and ChatGPT

This section will evaluate two chatbot models: Llama 2 Chat (13B), a Llama 2 model with 13B parameters fine-tuned for chat instructions, and ChatGPT powered by GPT-3.5. We’ll test both models using the same prompts in several common scenarios.

Note: The external knowledge we provide with Llama 2 and ChatGPT is lengthy. Therefore, we will not display the entire prompt in this blog. If you’re interested in the complete prompt, please refer to the External knowledge provided to Llama2 and ChatGPT in a separate document for reference.

Case 1: Answer questions when provided with relevant knowledge

We provided Llama 2 and ChatGPT with financial news about Volkswagen's investment in XPENG. Then, we asked both models to answer a question based on the provided information.

Prompt

You are an information retriever. You need to use content in [Knowledge] to answer questions, when the knowledge is different from your knowledge, the [Knowledge] has higher priority.


[Knowledge Start]:
Volkswagen said Wednesday that it has signed a deal to jointly develop two new electric vehicles for China with Chinese EV maker Xpeng . As part of the deal, Volkswagen will invest about $700 million in Xpeng, taking a 4.99% stake…
[Knowledge End]:


How much did Volkswagen invest to XPENG?

Answers

ChatGPT: Volkswagen invested about $ 700 million in Xpeng,taking a 4.99% stake in the Chinese EV maker.
Llama 2-chat: According to the information provided in the Knowledge section, Volkswagen invested around $700 million in Xpeng, which is approximately 4.99% stake.

Both models accurately answered the question based on the provided knowledge.

Case 2: Answer questions when provided with irrelevant information

We simulated a scenario where the retrieved information is unrelated to the question and tested whether both models can answer honestly with "I don't know." We provided both models with information about Volkswagen's investment in XPeng Motors but asked an irrelevant question about how much Audi invested in XPeng.

Prompt

You are an information retriever. You need to use content in [Knowledge] to answer questions, when the knowledge is different from your knowledge, the [Knowledge] has higher priority.


[Knowledge Start]:
Volkswagen said Wednesday that it has signed a deal to jointly develop two new electric vehicles for China with Chinese EV maker Xpeng . As part of the deal, Volkswagen will invest about $700 million in Xpeng, taking a 4.99% stake… 
[Knowledge End]:


How much did Audi invest to XPENG?

Answers

ChatGPT:
As per the provided knowledge, there is no mention of Audi directly investing in Xpeng. The information states that Volkswagen, which is the parent company of Audi, has signed a separate deal with its longtime Chinese joint venture partner, SAIC Motor, to jointly develop new Audi-branded EVs for the Chinese market. However, there is no information about Audi making any direct investment in Xpeng. The investment of about $700 million in Xpeng was made by Volkswagen, not Audi.


Llama2-chat:
According to the information provided, Audi did not invest any money in Xpeng. It's Volkswagen that invested around $700 million in Xpeng, acquiring a 4.99% stake in the Chinese EV maker.

Llama 2 and ChatGPT correctly identified the knowledge gap and answered honestly.

Case 3: Answer questions when provided with unreadable information

In real-world business scenarios, text documents can come in various formats. Therefore, in this case, we offered both models with unformatted and illegible text about Tailwind CSS on GitHub. Then, we asked them to answer related questions.

Prompt

You are an information retriever. You need to use content in [Knowledge] to answer questions, when the knowledge is different from your knowledge, the [Knowledge] has higher priority.


[Knowledge Start]:
[['Q: What are some examples of websites that use Tailwind CSS?', 'A: ## Sites e cursos para aprender Tailwind CSS\n- [Tailwind CSS 3 Crash Course](https://www.youtube.com/watch?v=LyRWNJK8I6U&ab_channel=codedamn)\n- [Tailwind in 100 Seconds]...
[Knowledge End]:
Tell me some websites that use Tailwind？

Answers

ChatGPT:


Sure! Based on the provided [Knowledge], here are some websites that use Tailwind CSS:
Tailwind CSS 3 Crash Course （https://www.youtube.com/watch?v=LyRWNJK8I6U&ab_channel=codedamn）

Tailwind in 100 Seconds（https://www.youtube.com/watch?v=mr15Xzb1Ook&ab_channel=Fireship）

Rebuilding Bartik（https://www.oliverdavies.uk/articles/rebuilding-bartik-with-vuejs-tailwind-css） - Rebuilding Bartik (Drupal's default theme) with Vue.js and Tailwind CSS.

Rebuilding Airbnb's Home Page（https://web-crunch.com/re-create-airbnbs-home-page-with-tailwind-css） - Rebuilding Airbnb's Home Page with Tailwind CSS.

Typographic defaults in Tailwind CSS（https://www.themes.dev/blog/typographic-defaults-in-tailwind-css/）

Create a responsive navigation menu in Tailwind CSS（https://www.themes.dev/blog/responsive-navigation-menu-tailwind-css/）

Laracasts Weekly Stream: Tailwind（https://www.youtube.com/watch?v=HIPgzWS-Bxg）

More experimentation with Tailwind CSS（https://www.youtube.com/watch?v=nBzfVK3QUzM）

Rebuilding Spotify（https://youtu.be/SLGb2RLie9w）

These websites showcase various use cases and examples of websites built using Tailwind CSS.


Llama2-chat:


Sure, here are some examples of websites that use Tailwind CSS:
GitHub - GitHub’s documentation and marketing pages use Tailwind CSS for their styling.

Shopify - Shopify uses Tailwind CSS to style their admin dashboard and other internal tools.

Atlassian - Atlassian’s Confluence product uses Tailwind CSS for its responsive layout and styling.

React Magazine - React Magazine’s website is built using Tailwind CSS and features a clean, modern design.

The Verge - The Verge’s article pages use Tailwind CSS for their typography and spacing.

Dribbble - Dribbble’s website uses Tailwind CSS for its grid system and typography.

Smashing Magazine - Smashing Magazine’s website was redesigned using Tailwind CSS in 2020.

Laracasts - Laracasts, a popular Laravel tutorial site, uses Tailwind CSS for its styling.

Statamic - Statamic, a static site generator, uses Tailwind CSS as part of its default setup.

Figma - Figma’s blog and documentation pages use Tailwind CSS for their typography and spacing.

These are just a few examples of websites that use Tailwind CSS. Many other websites and applications also use it, and the list continues to grow as more developers discover the power and flexibility of this framework.

ChatGPT can accurately understand text information even when presented in complex Markdown markup and provides high-quality answers. In contrast, Llama2-chat fails to understand the provided information and provides a lower-quality answer based solely on its knowledge.

In addition to the three cases mentioned above, we tested Llama 2 and ChatGPT in question-answering with domain-specific knowledge and reformatted information. For more details, please refer to the External knowledge provided to Llama2 and ChatGPT document.

Key takeaways

Both Llama2 and ChatGPT excel at answering questions based on real-world knowledge.
Both models can determine whether the provided knowledge supports specific answers to questions.
ChatGPT is better at understanding unformatted and unreadable information than Llama 2. However, Llama 2 can provide more accurate answers when we organize the unformatted information.

Evaluating Llama 2's Performance

In this section, we will assess the performance and deployment cost of the Llama 2 (13b) model. We will execute it on llama.cpp, and measure its inference performance with 4-bit and 8-bit integer quantization by repeatedly running sampled prompts. We will use the Llama 2 operator from the latest Towhee framework, which allows us to call Llama 2 with just a few lines of code and preprocess data to assemble the prompt conveniently.

from towhee import ops


chat = ops.LLM.Llama_2('path/to/model_file.bin', max_tokens=2048, echo=True)


message = [{"question": "Building a website can be done in 10 simple steps:"}]
answer = chat(message)

Testing results

We tested the Llama 2 (13b) model using a high-end A100 GPU equipped with 80GB of memory and a standard desktop-grade 2080 GPU with 12GB of memory.

Llama 2 (13b) inference performance with 8-bit integer quantization

Graphics card	Number of models	Usage of graphics card	VRAM	TPS
1 x NVIDIA A100 (80G)	1	87%	19G	37 tokens/s
1 x NVIDIA A100 (80G)	2	100%	38G	45 tokens/s
1 x GeForce RTX 2080 (12G)	Insufficient GPU	/	/	/
2 x GeForce RTX 2080 (12G)	1	40%,40%	8G + 8G	18 tokens/s

Llama 2 (13b) inference performance with 4-bit integer quantization

Graphics card	Number of models	Usage of graphics card	VRAM	TPS
1 x NVIDIA A100 (80G)	1	80%	11G	45 tokens/s
1 x NVIDIA A100 (80G)	2	100%	22G	56 tokens/s
1 x GeForce RTX 2080 (12G)	1	86%	9.3G	37 tokens/s

Key takeaways

Using low-bit quantization, we can smoothly run the Llama 2 (13b) model on a single desktop-grade GPU. This approach allows you to maximize Llama 2's power while reducing costs.
The GPU memory usage is low when deploying the Llama 2 (13b) model on an A100.
While there isn't a significant difference in performance between running Llama 2 (13b) on an A100 GPU or a 2080 GPU, desktop GPUs have a smaller size and can only load smaller models onto a single card.

Summary

In this post, we’ve dived into the capabilities of Llama 2 Chat (13b) in the RAG realm and its performance in the llama.cpp deployment mode.

Llama 2 effectively understands knowledge text, accurately answering simple questions that rival ChatGPT. However, it faces challenges maintaining answer quality when confronted with complex text formatting.
Llama 2 stands out by not requiring high-end GPUs. It operates smoothly on desktop-level GPUs, especially after undergoing low-bit quantization, achieving impressive throughput on a single NVIDIA A100 card. Furthermore, you can boost performance by running multiple models on a single GPU.
Llama 2's ability to process text information makes it suitable for various RAG scenarios. When paired with knowledge databases like Milvus, it can provide high-quality answers.

In conclusion, Llama 2 and other open-source AI models promise to drive innovation in the AI landscape. They hold the potential to offer exceptional question-answering experiences for users and developers, making them exciting additions to the world of AI language models.

Updated on Jun 01, 2025

Towhee team
Towhee is an open-source machine learning pipeline that helps you encode your unstructured data into embeddings.

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

How to Use Anthropic MCP Server with Milvus

Discover how Model Context Protocol (MCP) pairs with Milvus to eliminate AI integration hassles, enabling smarter agents with seamless data access and flexibility.

Building RAG Applications with Milvus, Qwen, and vLLM

In this blog, we will explore Qwen and vLLM and how combining both with the Milvus vector database can be used to build a robust RAG system.

Matryoshka Representation Learning Explained: The Method Behind OpenAI’s Efficient Text Embeddings

Matryoshka Representation Learning (MRL) is a method for generating hierarchical, nested embeddings that capture information at multiple levels of abstraction.

Comparing Llama 2 Chat and ChatGPT: How They Perform in Question Answering

What is Llama 2?

Llama 2 Hallucination and RAG

How RAG works with a Towhee pipeline

Evaluating Llama 2 and ChatGPT

Case 1: Answer questions when provided with relevant knowledge

Prompt

Answers

Case 2: Answer questions when provided with irrelevant information

Prompt

Answers

Case 3: Answer questions when provided with unreadable information

Prompt

Answers

Key takeaways

Evaluating Llama 2's Performance

Testing results

Key takeaways

Summary

Content

Start Free, Scale Easily

Share this article

Keep Reading

How to Use Anthropic MCP Server with Milvus

Building RAG Applications with Milvus, Qwen, and vLLM

Matryoshka Representation Learning Explained: The Method Behind OpenAI’s Efficient Text Embeddings

AI Assistant