LLM guardrails work with token-level filtering by analyzing and filtering out specific tokens (or words) in a response that may violate safety guidelines or ethical standards. Token-level filtering allows guardrails to operate at a granular level, preventing problematic words, phrases, or terms from being generated, regardless of the surrounding context.
For example, if a user requests explicit content, the guardrails can block certain offensive tokens like profanity or explicit language at the token level before they are output. This ensures that no harmful or inappropriate content makes it to the final response, even if it is part of a more complex sentence. Additionally, token-level filtering can be used to prevent the generation of biased or discriminatory terms by blocking certain words from the model’s vocabulary.
Token-level filtering is highly effective in preventing certain types of harmful content, but it may require continuous updates to stay current with emerging trends in language and usage. As language evolves, the guardrails must be adapted to account for new offensive terms or problematic phrases, ensuring that token-level filtering remains effective over time.