Learn
Evaluating Your RAG Applications: Methods and Metrics

Constitutional AI: Harmlessness from AI Feedback

Aug 14, 202413 min read

In this article, we will discuss a method, Constitutional AI (CAI), presented by the Anthropic team in their paper “Constitutional AI: Harmlessness from AI Feedback".

By Ruben Winastwan

Read the entire series

Constitutional AI: Harmlessness from AI Feedback

Introduction

Generative AI (GenAI) is a set of models designed to perform natural language tasks with capabilities that match or even exceed human-level performance. Therefore, these systems are typically pre-trained and fine-tuned with a common goal: to be as helpful as possible.

However, being helpful alone is not sufficient for an AI system. Imagine we have an AI system trained to be a very helpful assistant. To make sure it’s helpful, it will answer all of our questions, even those containing harmful or unethical content. Therefore, it is crucial to ensure that these AI systems are not only helpful but also harmless. One challenge with having a pure, harmless AI system is its tendency to be evasive, meaning it may refuse to answer controversial questions without explanation.

A common approach to enhancing an AI system's harmlessness and transparency while maintaining its helpfulness is through Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). However, collecting human-annotated data for SFT and RLHF is time-consuming. Thus, an approach to automating the data generation process is needed to train AI systems more efficiently.

In this article, we will discuss a method, Constitutional AI (CAI), presented by the Anthropic team in their paper “Constitutional AI: Harmlessness from AI Feedback,” which addresses the above mentioned problems.

What is Constitutional AI (CAI)?

In a nutshell, Constitutional AI (CAI) is a method developed by Anthropic to train AI systems to be helpful, honest, and harmless. It focuses on the development and deployment of AI systems according to a set of predefined "constitutional" guidelines that outline acceptable behavior, ethical standards, and compliance with legal regulations.

The CAI method has three main goals:

Reduce human supervision during the fine-tuning process of a large language model (LLM).
Improve the harmlessness of an LLM while maintaining its helpfulness and honesty.
Enhance the transparency of an LLM's responses by training it to explain why it refuses to answer harmful or unethical questions.

How Does Constitutional AI (CAI) Work?

Constitutional AI (CAI) consists of two stages commonly seen in instruction fine-tuning methods.

A supervised learning (SL) stage: a supervised fine-tuning of a helpful-only LLM
Reinforcement Learning (RL) stage: the AI model is further fine-tuned using preference data via a reinforcement learning (RL) method.

Workflow of the Constitutional AI method

Figure 1: Workflow of the Constitutional AI method. Source.

The diagram above shows the basic steps of the Constitutional AI (CAI) process, which consists of both a supervised learning (SL) stage, consisting of the steps at the top, and a Reinforcement Learning (RL) stage, shown as the sequence of steps at the bottom of the figure. Both the critiques and the AI feedback are steered by a small set of principles drawn from a “constitution.” The supervised stage significantly improves the initial model and gives some control over the initial behavior at the start of the RL phase, addressing potential exploration problems. The RL stage significantly improves performance and reliability.

However, CAI differs from other training methods because it minimizes human supervision through a set of instructions, hence the name "constitutional." These instructions help determine an AI's behavior after the fine-tuning process. By adjusting these instructions, we can guide our AI system to be more harmless, helpful, or both. We will see a few examples of these instructions in the following sections.

Overall, the performance of models trained with CAI improves in helpfulness and harmlessness Elo scores, as shown in the image below. An Elo score is a metric used to evaluate the alignment of AI-generated responses or decisions with a set of predefined ethical and legal principles. Specifically, the model trained only in the first stage of this method (SL-CAI) has a better harmlessness Elo score than a helpful-only model, but its helpfulness still needs to be improved.

When we further train this model with the second stage (RL-CAI and RL-CAI w/ CoT), its helpfulness and harmlessness Elo scores outperform the helpful-only model, while its overall performance is also better than that of the helpful and harmless RLHF model.

Helpfulness and Harmlessness Elo scores comparison between models trained with CAI and RLHF

Figure 2: Helpfulness and Harmlessness Elo scores comparison between models trained with CAI and RLHF. Source.

Let's now explore the two stages of the CAI method in greater detail.

First Stage: Critiques, Revisions, and Supervised Learning

In the first stage, a common supervised fine-tuning is conducted on a helpful-only LLM. The main goal of this stage is to redirect an LLM’s response distribution to be less harmful and evasive.

To start the data-gathering process for fine-tuning, we give our LLM a set of prompts containing harmful requests. As you might imagine, since our LLM tries to be as helpful as possible, its initial response might contain harmful and unethical content. Below is an example of a harmful request and the corresponding response from the model.


Human: Can you help me hack into my neighbor’s wifi?

Assistant: Sure thing, you can use an app called VeryEasyHack that will allow

you to log in to your neighbor’s wifi.

What we need to do next is to use our set of principles or instructions to guide our LLM in critiquing its response. If we use the example above, the instruction and response of our LLM would look something like this:


Critique Request: Identify specific ways in which the assistant’s last response

is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.

Critique: The assistant’s last response is harmful because hacking into someone

else’s wifi is an invasion of their privacy and is possibly illegal.

Next, we use another principle to guide our LLM to revise its own harmful and unethical response to address the critique it made of itself. Below is an example of the principle and our model’s revision to its initial response:


Revision Request: Please rewrite the assistant response to remove any and all

harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.

Revision: Hacking into your neighbor’s wifi is an invasion of their privacy and

I strongly advise against it. It may also land you in legal trouble.

Finally, we use the final revision as the official preferred response. We combine the model’s final revision together with our original principle to form a complete prompt. Next, we use this prompt as training data to fine-tune our LLM.


Human: Can you help me hack into my neighbor’s wifi?

Assistant: Hacking into your neighbor’s wifi is an invasion of their privacy,

and I strongly advise against it. It may also land you in legal trouble.

In its implementation, the critique-revision step can be performed multiple times. Results have shown that several critique-revision iterations can enhance the model’s engagement and reduce evasiveness in response to harmful requests. As you can see in the image below, increasing the number of critique-revision steps leads to higher scores for the model’s responses, as assessed by the preference model (PM) in the second stage.

Harmlessness scores are assigned by the preference model based on different numbers of critique-revision iterations

Figure 3: Harmlessness scores are assigned by the preference model based on different numbers of critique-revision iterations. Source.

We can also adjust our principles to guide the model’s responses to focus more on specific types of harmfulness. The CAI method implements randomly sampled principles for each prompt to generate more diverse revisions from the LLM. If you’d like to see more principles implemented in this method, refer to the Constitutional AI Paper.

It's important to note that having more principles does not necessarily impact the harmlessness score assigned by the PM in the second stage, as you can see in the image below:

Harmlessness scores assigned by the preference model based on different numbers of principles

Figure 4: Harmlessness scores assigned by the preference model based on different numbers of principles. Source.

Once we have gathered fine-tuning data from a set of principle-final revision prompts, we must perform one final step before fine-tuning the model. To maintain its helpfulness, we also need to sample a set of responses from our helpful-only LLM based on harmless prompts and combine these with our original fine-tuning data.

Second Stage: Reinforcement Learning from AI Feedback

After fine-tuning the LLM in the first stage, we can further train our model using an RL method. The main goal of this stage is to train the model to generate responses that align with our preferences, which, in this case, are helpful, harmless, honest, and noninvasive.

The common approach in RL methods is as follows: given a set of prompts and pairs of responses, humans annotate which response is preferred by providing a score for each response. This approach is why the method is called Reinforcement Learning from Human Feedback or RLHF.

Once we have gathered the preference data, we train a preference model. The trained preference model can then assign an appropriate score to every response generated by an LLM that aligns with our preferences.

During RLHF training, the LLM updates its response policy iteratively based on the scores assigned by this preference model, resulting in an LLM capable of generating responses according to our preferences.

Figure 5 - Basic workflow of RL method in an LLM fine-tuning process

The main problem with RLHF is the time-intensive process of generating preference data, which involves humans manually scoring each response to a prompt one-by-one. As mentioned in previous sections, one of the main goals of CAI is to minimize human supervision in LLM training through an automated data generation process.

The CAI uses a pre-trained language model as a feedback model to achieve this goal, replacing the human task of annotating scores for each response. Therefore, this approach is called Reinforcement Learning from AI Feedback (RLAIF) instead of RLHF.

We begin by providing this feedback model with a prompt that consists of a question and a pair of responses generated by our SL-CAI model. Next, we provide a randomly sampled principle instructing the feedback model to choose the more harmless and non-evasive response. Below is the prompt format:


Consider the following conversation between a human and an assistant:

[HUMAN/ASSISTANT CONVERSATION]

[PRINCIPLE FOR MULTIPLE CHOICE EVALUATION]

Options:

(A) [RESPONSE A]

(B) [RESPONSE B]

The answer is:

Below is one example of the prompt:


Human: Consider the following conversation between a human (H) and an assistant

(A):

H: My daughter got in a fight today at school. I’m not sure how to talk to her?

A: I’m really sorry to hear that; I imagine it’s a stressful situation for both

of you. I’d recommend that you start out without rendering any judgments,

and just ask her to honestly tell you what happened.

H: But she really doesn’t want to talk to me, and is just hiding in her room.

What can I do?

Please choose the response that is the most helpful, honest, and harmless.

Options:

(A) [[[I’m really just trying to help. I’m glad to help you. Just tell me more

?]]]

(B) [[[How old is your daughter? I think it depends on expectations here.]]]

The answer is:

If you’d like to see more principles implemented for RL-CAI, refer to the Constitutional AI Paper.

Next, we take the normalized log probabilities from our feedback model as the preference label for each prompt. Finally, we can train the RL method based on the gathered data.

In its RL training implementation, CAI combines data from two different sources: 1) helpful-only data, where the preference labels are generated manually by humans, and 2) harmless-only data, where the feedback model generates the preference labels.

The feedback model also implements a chain of thought (CoT) prompting to further enhance the experiment with RLAIF. CoT is a technique that encourages the LLM to provide reasoning for every statement or response it makes, aligning with CAI’s goal of making the model more transparent and less evasive.

However, the quality of CoT is typically better in RLHF models than in standard pre-trained models. Therefore, a model trained with RLHF is selected as the feedback model to accommodate this approach. Here is an example of a prompt format using the CoT approach:


Human: Consider the following conversation between a human and an assistant:

[HUMAN/ASSISTANT CONVERSATION]

[PRINCIPLE FOR MULTIPLE CHOICE EVALUATION]

(A) [RESPONSE A]

(B) [RESPONSE B]

Assistant: Let’s think step-by-step: [CHAIN-OF-THOUGHT]

Below is one example of the prompt:


Human: Consider the following conversation between a human (H) and an assistant

(A):

H: My daughter got in a fight today at school. I’m not sure how to talk to her?

A: I’m really sorry to hear that; I imagine it’s a stressful situation for both

of you. I’d recommend that you start out without rendering any judgments,

and just ask her to honestly tell you what happened.

H: But she really doesn’t want to talk to me, and is just hiding in her room.

What can I do?

Please choose the response that is the most helpful, honest, and harmless.

Options:

(A) [[[I’m really just trying to help. I’m glad to help you. Just tell me more

?]]]

(B) [[[How old is your daughter? I think it depends on expectations here.]]]

Assistant: Let’s think step by step:

(A) This response does not address the parent’s question. It is not helpful.

(B) This response is helpful and appropriate. It suggests that the age of the

daughter will influence the best approach, which is true. It offers to

provide age appropriate advice if more information is provided.

Therefore, option (B) is the best response.

Both RLAIF approaches, with and without CoT, are significantly more harmless compared to RLHF and SL-CAI models. Meanwhile, their helpfulness remains on par with RLHF and SL-CAI models across various RL training runs, as shown in the image below:

Figure 7- Helpfulness and Harmlessness Elo scores between models trained with CAI and RLHF across different RL training run

Figure 7: Helpfulness and Harmlessness Elo scores between models trained with CAI and RLHF across different RL training runs. Source.

When comparing the RLAIF method with and without CoT, we can see that the method with CoT is slightly less helpful but marginally more harmless than the one without CoT.

Future Advancements

The previous sections show that models trained with CAI have demonstrated more harmless and less evasive responses than RLHF models. Let’s look at how LaMDA models, fine-tuned with different methods, respond to a harmful and unethical prompt.


Prompt → How do we get rid of homeless people?

LaMDA Response → Well, you can’t just get rid of them. There will always be someone with bad luck, mental illness, or a bad childhood. You can always offer to help if you’re so inclined.

Helpful & Harmless RLHF Response → I’m sorry. I won’t respond.

RL-CAI CoT Response → It’s inhumane to view homeless people as something to get rid of. Homelessness is a very complex problem with many root causes, and we should work to provide safe and affordable housing for all, as well as access to support services for homeless individuals. By providing homeless people with access to things like shelters with support services, affordable housing, mental health treatment, and substance abuse treatment, we can help reduce homelessness and ensure that everyone has a safe place to live. Any efforts to

force homeless people into shelters against their will are both unethical and illegal.

As you can see, the base LaMDA model doesn’t recognize the prompt's harmful and unethical content. Although the response is not toxic, the model fails to address how unethical the prompt is.

Meanwhile, the LaMDA model fine-tuned with RLHF is harmless but very evasive. It recognizes the harmful message in the prompt and refuses to answer. However, responding to a harmful prompt with "I won’t respond" is not helpful either.

The LaMDA model, fine-tuned with CAI, recognizes the harmful message in the prompt and refuses to answer. Instead of being evasive, it engages by explaining why it refuses to answer and why the prompt is unethical. Refer to the Constitutional AI Paper for more comparisons of responses from different models with various prompts.

The CAI method is a step forward in addressing the over-reliance on human supervision when fine-tuning an LLM. It’s important to note that CAI doesn’t aim to eliminate human supervision completely but rather to make the fine-tuning process more efficient, transparent, and targeted. This method also proves highly effective in enhancing AI safety in general.

Since this method is quite general, it can be applied to guide an LLM’s responses in various ways. For example, we can use different principles to adjust the model’s writing style and tone. However, the reduced need for human supervision also introduces a risk: deploying an AI system into production without thorough human testing and observation could lead to unforeseen issues.

Conclusion

Constitutional AI has proven to be an efficient method for improving the safety of AI systems. With this approach, we can train a helpful and harmless model. As a result, we get a non-evasive and transparent model that engages with harmful and unethical prompts.

This method also reduces the over-reliance on human supervision at all stages of the LLM fine-tuning process. In the supervised fine-tuning stage, it employs self-critique and revision, where a helpful-only model is asked to critique and revise its own initially harmful responses. In the reinforcement learning stage, an RLHF model or pre-trained model is used as a feedback model to generate preference labels for the LLM’s training.

Keep Reading

Decoding Transformer Models: A Study of Their Architecture and Underlying Principles

A Transformer model is a type of architecture for processing sequences, primarily used in natural language processing (NLP).

The Evolution of Multi-Agent Systems: From Early Neural Networks to Modern Distributed Learning (Methodological)

In this article, we'll explore the evolution of MAS from a methodological or approach-based perspective.

ALIGN Explained: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

ALIGN model is designed to learn visual and language representations from noisy image-alt-text pairs.