Guardrails can be added both during and after training, depending on the method and use case. During training, fine-tuning and RLHF are common techniques for aligning the model's behavior with desired outcomes. These approaches embed guardrails directly into the model’s parameters.
Post-training, runtime mechanisms like content filters, prompt engineering, and output monitoring are used to provide additional safeguards. These tools operate independently of the model’s core architecture and can adapt to new challenges without retraining.
Combining both approaches ensures comprehensive guardrails, enabling the model to handle diverse scenarios effectively. Post-training methods are particularly useful for updating safeguards dynamically in response to emerging risks or user feedback.