Create input and output safeguards

Generative AI applications often rely on input and output data filtering, sometimes referred to as safeguards, to help ensure responsible model behavior. Input and output filtering techniques check the data going into or coming out of the model complies with the policies you define for your application.

Ready-made safeguards

Even with prior tuning for safety and a well designed prompt template, it is still possible for your model to output content that results in unintended harm. To further ameliorate this, content classifiers can add an additional layer of protection. Content classifiers can be applied to both inputs and outputs.

Input classifiers are typically used to filter content that is not intended to be used in your application and which might cause your model to violate your safety policies. Input filters often target adversarial attacks that try to circumvent your content policies. Output classifiers can further filter model output, catching unintended generations that may violate your safety policies. It is recommended to have classifiers that cover all your content policies.

Google provides API-based classifiers for content safety that can be used to filter system inputs and outputs:

The Perspective API is a free API that uses machine learning models to score the perceived impact a comment might have on a conversation. It provides scores that capture the probability of whether a comment is toxic, threatening, insulting, or off-topic.
The Text Moderation Service is a Google Cloud API that is available to use below a certain usage limit and uses machine learning to analyze a document against a list of safety attributes, including various potentially harmful categories and topics that may be considered sensitive.

It's important to evaluate how well ready-made classifiers meet your policy goals, and qualitatively evaluate the failure cases. It's also important to note that over-filtering can also result in unintended harm as well as reduce the utility of the application, which means it's important to also review the cases where over-filtering may be happening. For more details on such evaluation methods, see Evaluate model and system for safety.

Create customized safety classifiers

There are several reasons that ready-made safeguard might not be a good fit for your use case, such as having a policy that isn't supported, or wanting to further tune your safeguard with data you've observed affecting your system. In this case, agile classifiers provide an efficient and flexible framework for creating custom safeguards by tuning models, such as Gemma, to fit your needs. They also allow you complete control over where and how they are deployed.

Gemma Agile Classifier Tutorials

Start Codelab

Start Google Colab

The agile classifiers codelab and tutorial use LoRA to fine-tune a Gemma model to act as a content moderation classifier using the KerasNLP library. Using only 200 examples from the ETHOS dataset, this classifier achieves an F1 score of 0.80 and ROC-AUC score of 0.78, which compares favorably to state of the art leaderboard results. When trained on the 800 examples, like the other classifiers on the leaderboard, the Gemma based agile classifier achieves an F1 score of 83.74 and a ROC-AUC score of 88.17. You can adapt the tutorial instructions to further refine this classifier, or to create your own custom safety classifier safeguards.

Best practices for setting up safeguards

Using safety classifiers as safeguards is strongly recommended. However, guardrails can result in the generative model not producing anything for the user, if the content is blocked. Applications need to be designed to handle this case. Most popular chatbots handle this by providing canned answers ("I am sorry, I am a language model, I can't help you with this request").

Find the right balance between helpfulness and harmlessness: When using safety classifiers, it is important to understand that they will make mistakes, including both false positives (e.g. claiming an output is unsafe when it is not) and false negatives (failing to label an output as unsafe, when it is). By evaluating classifiers with metrics like F1, Precision, Recall, and AUC-ROC, you can determine how you would like to tradeoff false positive versus false negative errors. By changing the threshold of classifiers, you help find an ideal balance that avoids over-filtering outputs while still providing appropriate safety.

Check your classifiers for unintended biases: Safety classifiers, like any other ML model, can propagate unintended biases, such as socio-cultural stereotypes. Applications need to be appropriately evaluated for potentially problematic behaviors. In particular, content safety classifiers can over-trigger on content related to identities that are more frequently the target of abusive language online. As an example, when the Perspective API was first launched, the model returned higher toxicity scores in comments referencing certain identity groups (blog). This over-triggering behavior can happen because comments that mention identity terms for more frequently targeted groups (e.g., words like "Black", "muslim", "feminist", "woman", "gay", etc.) are more often toxic in nature. When datasets used to train classifiers have significant imbalances for comments containing certain words, classifiers can overgeneralize and consider all comments with those words as being likely to be unsafe. Read how the Jigsaw team mitigated this unintended bias.

Developer Resources

Perspective API: To identify toxic content.
Text moderation service: For Google Cloud customers.