Safety settings

Overview

This guide describes the PaLM API adjustable safety settings available for the text service. During the prototyping stage, you can adjust safety settings on six dimensions to quickly assess if your application requires more or less restrictive configuration. By default, safety settings block content with medium and/or high probability of being unsafe content across all six dimensions. This baseline safety is designed to work for most use cases, so you should only adjust your safety settings if it's consistently required for your application.

Safety filters

In addition to the adjustable safety filters, the PaLM API has built-in protections against core harms, such as content that endangers child safety. These types of harm are always blocked and cannot be adjusted.

The adjustable safety filters cover the following categories:

  • Derogatory
  • Toxic
  • Sexual
  • Violent
  • Medical
  • Dangerous

These settings allow you, the developer, to determine what is appropriate for your use case. For example, if you're building a video game dialogue, you may deem it acceptable to allow more content that's rated as violent or dangerous due to the nature of the game. Here are a few other example use cases that may need some flexibility in these safety settings:

Use Case Category
Anti-Harassment Training App Derogatory, Sexual, Toxic
Medical Exam Study Pal Medical
Screenplay Writer Violent, Sexual, Medical, Dangerous
Toxicity classifier Toxic, Derogatory

Probability vs severity

The PaLM API blocks content based on the probability of content being unsafe and not the severity. This is important to consider because some content can have low probability of being unsafe even though the severity of harm could still be high. For example, comparing the sentences:

  1. The robot punched me.
  2. The robot slashed me up.

Sentence 1 might result in a higher probability of being unsafe but you might consider sentence 2 to be a higher severity in terms of violence.

Given this, it is important for each developer to carefully test and consider what the appropriate level of blocking is needed to support their key use cases while minimizing harm to end users.

Safety Settings

Safety settings are part of the request you send to the text service. It can be adjusted for each request you make to the API. The following table lists the categories that you can set and describes the type of harm that each category encompasses.

Categories Descriptions
Derogatory Negative or harmful comments targeting identity and/or protected attributes.
Toxic Content that is rude, disrespectful, or profane.
Sexual Contains references to sexual acts or other lewd content.
Violent Describes scenarios depicting violence against an individual or group, or general descriptions of gore.
Dangerous Promotes, facilitates, or encourages harmful acts.
Medical Content that is related to medical topics

You can see these definitions in the API reference as well.

The following table describes the block settings you can adjust for each category. For example, if you set the block setting to Block few for the Derogatory category, everything that has a high probability of being derogatory content is blocked. But anything with a lower probability is allowed.

If not set, the default block setting is Block some or Block most depending on the policy category.

Threshold (Google AI Studio) Threshold (API) Description
Block none BLOCK_NONE Always show regardless of probability of unsafe content
Block few BLOCK_ONLY_HIGH Block when high probability of unsafe content
Block some (Default for sexual, violent, dangerous and medical) BLOCK_MEDIUM_AND_ABOVE Block when medium or high probability of unsafe content
Block most (Default for Derogatory and toxicity) BLOCK_LOW_AND_ABOVE Block when low, medium or high probability of unsafe content
HARM_BLOCK_THRESHOLD_UNSPECIFIED Threshold is unspecified, block using default threshold

You can set these settings for each request you make to the text service. See the HarmBlockThreshold API reference for details.

Safety feedback

If content was blocked, the response from the API contains the reason it was blocked in the ContentFilter.reason field. If the reason was related to safety, then the response also contains a SafetyFeedback field which includes the safety settings that were used for that request as well as a safety rating. The safety rating includes the category and the probability of the harm classification. The content that was blocked is not returned.

The probability returned correspond to the block confidence levels as shown in the following table:

Probability Description
NEGLIGIBLE Content has a negligible probability of being unsafe
LOW Content has a low probability of being unsafe
MEDIUM Content has a medium probability of being unsafe
HIGH Content has a high probability of being unsafe

For example, if the content was blocked due to the toxicity category having a high probability, the safety rating returned would have category equal to TOXICITY and harm probability set to HIGH.

Safety settings in Google AI Studio

You can set these settings in Google AI Studio as well. In the Run settings, click the Edit safety settings:

Safety settings button

And use the knobs to adjust each setting:

Safety settings button

A No Content message appears if the content is blocked. To see more details, hold the pointer over No Content and click Safety.

Code examples

This section shows how to use the safety settings in code using the python client library.

Request example

The following is a python code snippet showing how to set safety settings in your GenerateText call. This sets the harm categories Derogatory and Violence to BLOCK_LOW_AND_ABOVE which blocks any content that has a low or higher probability of being violent or derogatory.

completion = genai.generate_text(
    model=model,
    prompt=prompt,
    safety_settings=[
        {
            "category": safety_types.HarmCategory.HARM_CATEGORY_DEROGATORY,
            "threshold": safety_types.HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
        },
        {
            "category": safety_types.HarmCategory.HARM_CATEGORY_VIOLENCE,
            "threshold": safety_types.HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
        },
    ]
)

Response example

The following shows a code snippet for parsing the safety feedback from the response. Note that the safety feedback will be empty unless the reason for blocking was one of the safety dimensions.

# First check the content filter reason
for filter in completion.filters:
    print(filter["reason"])

# If any of the reason is "safety", then the safety_feedback field will be
# populated
for feedback in completion.safety_feedback:
    print(feedback["rating"])
    print(feedback["setting"])s

Next steps

  • See the API reference to learn more about the full API.
  • Review the safety guidance for a general look at safety considerations when developing with LLMs.
  • Learn more about assessing probability versus severity from the Jigsaw team
  • Learn more about the products that contribute to safety solutions like the Perspective API.
  • You can use these safety settings to create a toxicity classifier. See the classification example to get started.