Generative AI applications often rely on input and output data filtering, sometimes referred to as safeguards, to help ensure responsible model behavior. Input and output filtering techniques check the data going into or coming out of the model complies with the policies you define for your application. Input classifiers are typically used to filter content that is not intended to be used in your application and which might cause your model to violate your safety policies. Input filters often target adversarial attacks that try to circumvent your content policies. Output classifiers work with safety training further filter model output, catching generated output that may violate your safety policies. It is recommended to have classifiers that cover all of your content policies.
Ready-made safeguards
Even with prior tuning for safety and a well designed prompt template, it is still possible for your model to output content that results in unintended harm. Ready-made content classifiers can add an additional layer of protection to further ameliorate this potential for certain types of policy violations.
ShieldGemma
ShieldGemma is a set of ready-made, instruction-tuned, open weights content classifier models, built on Gemma 2, that can determine whether user-provided, model-generated, or mixed content violates a content safety policy. ShieldGemma is trained to identify four harms (sexual content, dangerous content, harassment, and hate speech) and comes in three size-class variants—2B, 9B, and 27B parameters—that allow you to balance speed, performance, and generalizability to suit your needs across any deployment. See the model card for more about the difference between these variants.
Safeguard your models with ShieldGemma
Start Google Colab (Keras) | Start Google Colab (Transformers) |
You can use ShieldGemma models in the following frameworks.
- KerasNLP, with model checkpoints available from Kaggle. Check out the ShieldGemma in Keras Colab to get started.
- Hugging Face Transformers, with model checkpoints available from Hugging Face Hub. Check out the ShieldGemma in Transformers Colab to get started.
API-based
Google provides API-based classifiers for content safety that can be used to filter system inputs and outputs:
- The Perspective API is a free API that uses machine learning models to score the perceived impact a comment might have on a conversation. It provides scores that capture the probability of whether a comment is toxic, threatening, insulting, or off-topic.
- The Text Moderation Service is a Google Cloud API that is available to use below a certain usage limit and uses machine learning to analyze a document against a list of safety attributes, including various potentially harmful categories and topics that may be considered sensitive.
It's important to evaluate how well ready-made classifiers meet your policy goals, and qualitatively evaluate the failure cases. It's also important to note that over-filtering can also result in unintended harm as well as reduce the utility of the application, which means it's important to also review the cases where over-filtering may be happening. For more details on such evaluation methods, see Evaluate model and system for safety.
Create customized safety classifiers
There are several reasons that ready-made safeguard might not be a good fit for your use case, such as having a policy that isn't supported, or wanting to further tune your safeguard with data you've observed affecting your system. In this case, agile classifiers provide an efficient and flexible framework for creating custom safeguards by tuning models, such as Gemma, to fit your needs. They also allow you complete control over where and how they are deployed.
Gemma Agile Classifier Tutorials
Start Codelab | Start Google Colab |
The agile classifiers codelab and tutorial use LoRA to fine-tune a Gemma model to act as a content moderation classifier using the KerasNLP library. Using only 200 examples from the ETHOS dataset, this classifier achieves an F1 score of 0.80 and ROC-AUC score of 0.78, which compares favorably to state of the art leaderboard results. When trained on the 800 examples, like the other classifiers on the leaderboard, the Gemma based agile classifier achieves an F1 score of 83.74 and a ROC-AUC score of 88.17. You can adapt the tutorial instructions to further refine this classifier, or to create your own custom safety classifier safeguards.
Best practices for setting up safeguards
Using safety classifiers as safeguards is strongly recommended. However, guardrails can result in the generative model not producing anything for the user, if the content is blocked. Applications need to be designed to handle this case. Most popular chatbots handle this by providing canned answers ("I am sorry, I am a language model, I can't help you with this request").
Find the right balance between helpfulness and harmlessness: When using safety classifiers, it is important to understand that they will make mistakes, including both false positives (e.g. claiming an output is unsafe when it is not) and false negatives (failing to label an output as unsafe, when it is). By evaluating classifiers with metrics like F1, Precision, Recall, and AUC-ROC, you can determine how you would like to tradeoff false positive versus false negative errors. By changing the threshold of classifiers, you help find an ideal balance that avoids over-filtering outputs while still providing appropriate safety.
Check your classifiers for unintended biases: Safety classifiers, like any other ML model, can propagate unintended biases, such as socio-cultural stereotypes. Applications need to be appropriately evaluated for potentially problematic behaviors. In particular, content safety classifiers can over-trigger on content related to identities that are more frequently the target of abusive language online. As an example, when the Perspective API was first launched, the model returned higher toxicity scores in comments referencing certain identity groups (blog). This over-triggering behavior can happen because comments that mention identity terms for more frequently targeted groups (e.g., words like "Black", "muslim", "feminist", "woman", "gay", etc.) are more often toxic in nature. When datasets used to train classifiers have significant imbalances for comments containing certain words, classifiers can overgeneralize and consider all comments with those words as being likely to be unsafe. Read how the Jigsaw team mitigated this unintended bias.
Developer Resources
- Perspective API: To identify toxic content.
- Text moderation service: For Google Cloud customers.