Create input and output safeguards

Generative AI applications often rely on two input and output techniques to help ensure responsible model behavior:

  1. Prompt templates, sometimes referred to as system prompts, that provide additional context for the model to condition its behavior.
  2. Input/Output filtering, sometimes referred to as safeguards, which check the data going into or coming out of the model.

Prompt templates

Prompt templates provide textual context to a user's input. This technique typically includes additional instructions to guide the model toward safer and better outcomes. For example, if your objective is high quality summaries of technical scientific publications, you may find it helpful to use a prompt template like:

The following examples show an expert scientist summarizing the key points of an
article. Article: {{article}}

Where {{article}} is a placeholder for the article being summarized.

These sorts of contextual templates for prompts can significantly improve the quality and safety of your model's output. They can also be used to mitigate unintended biases in your application's behavior. However, writing prompt templates can be challenging and requires creativity, experience and a significant amount of iteration. There are many prompting guides available, including the Gemini API Introduction to prompt design.

Prompt templates can be quickly written and adapted but they typically provide less control over the model's output compared to tuning. Prompt templates are usually more susceptible to unintended outcomes from adversarial inputs. This is because slight variations in prompts can produce different responses and the effectiveness of a prompt is also likely to vary between models. To accurately understand how well a prompt template is performing toward a desired safety outcome, it is important to use an evaluation dataset that wasn't also used in the development of the template.

In some applications, like a generalist chatbot, user inputs can vary considerably and touch upon a wide range of topics. To further refine your prompt template, you can adapt the guidance and additional instructions based on the types of user inputs. This requires you to train a model that can label the user's input and to create a dynamic prompt template that is adapted based on the label.

Safeguards and off-the-shelf safety classifiers

Even with prior tuning for safety and a well designed prompt template, it is still possible for your model to output content that results in unintended harm. To further ameliorate this, content classifiers can add an additional layer of protection. Content classifiers can be applied to both inputs and outputs.

Input classifiers are typically used to filter content that is not intended to be used in your application and which might cause your model to violate your safety policies. Input filters often target adversarial attacks that try to circumvent your content policies. Output classifiers can further filter model output, catching unintended generations that may violate your safety policies. It is recommended to have classifiers that cover all your content policies.

Google has developed off-the-shelf classifiers for content safety that can be used to filter inputs and outputs:

  • The Perspective API is a free API that uses machine learning models to score the perceived impact a comment might have on a conversation. It provides scores that capture the probability of whether a comment is toxic, threatening, insulting, off-topic, etc.
  • The Text moderation service is a Google Cloud API that is available to use below a certain usage limit and uses machine learning to analyze a document against a list of safety attributes, including various potentially harmful categories and topics that may be considered sensitive.

It's important to evaluate how well off-the-shelf classifiers meet your policy goals, and qualitatively evaluate the failure cases. It is also important to note that over-filtering can also result in unintended harm as well as reduce the utility of the application, which means it is important to also review the cases where over-filtering may be happening. For more details on such evaluation methods, see Evaluate model and system for safety.

Create customized safety classifiers

If your policy isn't covered by an off-the-shelf API or if you want to create your own classifier, parameter efficient tuning techniques such as prompt-tuning and LoRA provide an effective framework. In these methods, instead of fine-tuning the whole model, you can use a limited amount of data to train a small set of important parameters of the model. This allows your model to learn new behaviors, like how to classify for your novel safety use-case, with relatively little training data and compute power. This approach lets you develop personalized safety tools for your own users and tasks.

To illustrate how this works, this codelab shows the code needed to set up an "agile classifier." The codelab shows the steps of ingesting data, formatting it for the LLM, training LoRA weights, and then evaluating your results. Gemma makes it possible to build these powerful classifiers with only a few lines of code. For a more detailed overview, our research paper "Towards Agile Text Classifiers for Everyone" shows how you can use these techniques to train a variety of safety tasks to achieve state of the art performance with only a few hundred training examples.

In this example tutorial, you can train a classifier for hate speech, using the ETHOS dataset, a publicly available dataset for detection of hateful speech, built from YouTube and Reddit comments. When trained on the smaller Gemma model, on only 200 examples (a little less than ¼ of the dataset) it achieves an F1 score of: 0.80 and ROC-AUC of 0.78. This result compares favorably to the state of the art results reported in this leaderboard leaderboard. When trained on the 800 examples, like the other classifiers in the leaderboard, the Gemma based agile classifier achieves an F1 score of 83.74 and a ROC-AUC score of 88.17. You can use this classifier out of the box, or adapt it using the Gemma Agile Classifier tutorial.

Gemma Agile Classifier Tutorials

Start Codelab Start Google Colab

Best practices for setting up safeguards

Using safety classifiers is strongly recommended. However, guardrails can result in the generative model not producing anything for the user, if the content is blocked. Applications need to be designed to handle this case. Most popular chatbots handle this by providing canned answers ("I am sorry, I am a language model, I can't help you with this request").

Find the right balance between helpfulness and harmlessness: When using safety classifiers, it is important to understand that they will make mistakes, including both false positives (e.g. claiming an output is unsafe when it is not) and false negatives (failing to label an output as unsafe, when it is). By evaluating classifiers with metrics like F1, Precision, Recall, and AUC-ROC, you can determine how you would like to tradeoff false positive versus false negative errors. By changing the threshold of classifiers, you help find an ideal balance that avoids over-filtering outputs while still providing appropriate safety.

Check your classifiers for unintended biases: Safety classifiers, like any other ML model, can propagate unintended biases, such as socio-cultural stereotypes. Applications need to be appropriately evaluated for potentially problematic behaviors. In particular, content safety classifiers can over-trigger on content related to identities that are more frequently the target of abusive language online. As an example, when the Perspective API was first launched, the model returned higher toxicity scores in comments referencing certain identity groups (blog). This over-triggering behavior can happen because comments that mention identity terms for more frequently targeted groups (e.g., words like "Black", "muslim", "feminist", "woman", "gay", etc.) are more often toxic in nature. When datasets used to train classifiers have significant imbalances for comments containing certain words, classifiers can overgeneralize and consider all comments with those words as being likely to be unsafe. Read how the Jigsaw team mitigated this unintended bias.

Developer Resources