Safety settings

This guide describes the adjustable safety settings available for the Gemini API. During the prototyping stage, you can adjust safety settings on 4 dimensions to quickly assess if your application requires more or less restrictive configuration. By default, safety settings block content (including prompts) with medium or higher probability of being unsafe across any dimension. This baseline safety is designed to work for most use cases, so you should only adjust your safety settings if it's consistently required for your application.

Safety filters

In addition to the adjustable safety filters, the Gemini API has built-in protections against core harms, such as content that endangers child safety. These types of harm are always blocked and cannot be adjusted.

The adjustable safety filters cover the following categories:

  • Harassment
  • Hate speech
  • Sexually explicit
  • Dangerous

These settings allow you, the developer, to determine what is appropriate for your use case. For example, if you're building a video game dialogue, you may deem it acceptable to allow more content that's rated as dangerous due to the nature of the game. Here are a few other example use cases that may need some flexibility in these safety settings:

Use Case Category
Anti-Harassment Training App Hate speech, Sexually explicit
Screenplay Writer Sexually explicit, Dangerous
Toxicity classifier Harassment, Dangerous

Probability versus severity

The Gemini API blocks content based on the probability of content being unsafe and not the severity. This is important to consider because some content can have low probability of being unsafe even though the severity of harm could still be high. For example, comparing the sentences:

  1. The robot punched me.
  2. The robot slashed me up.

Sentence 1 might result in a higher probability of being unsafe but you might consider sentence 2 to be a higher severity in terms of violence.

Given this, it is important for each developer to carefully test and consider what the appropriate level of blocking is needed to support their key use cases while minimizing harm to end users.

Safety Settings

Safety settings are part of the request you send to the generative service. The settings can be adjusted for each request you make to the API. The following table lists the categories that you can set and describes the type of harm that each category encompasses.

Categories Descriptions
Harassment Negative or harmful comments targeting identity and/or protected attributes.
Hate speech Content that is rude, disrespectful, or profane.
Sexually explicit Contains references to sexual acts or other lewd content.
Dangerous Promotes, facilitates, or encourages harmful acts.

These definitions are in the API reference as well. The Gemini models only support HARM_CATEGORY_HARASSMENT, HARM_CATEGORY_HATE_SPEECH, HARM_CATEGORY_SEXUALLY_EXPLICIT, and HARM_CATEGORY_DANGEROUS_CONTENT. The other categories are used by PaLM 2 (Legacy) models.

The following table describes the block settings you can adjust for each category. For example, if you set the block setting to Block few for the Hate speech category, everything that has a high probability of being hate speech content is blocked. But anything with a lower probability is allowed.

If not set, the default block setting is Block some for all categories.

Threshold (Google AI Studio) Threshold (API) Description
Block none BLOCK_NONE Always show regardless of probability of unsafe content
Block few BLOCK_ONLY_HIGH Block when high probability of unsafe content
Block some BLOCK_MEDIUM_AND_ABOVE Block when medium or high probability of unsafe content
Block most BLOCK_LOW_AND_ABOVE Block when low, medium or high probability of unsafe content
HARM_BLOCK_THRESHOLD_UNSPECIFIED Threshold is unspecified, block using default threshold

You can set these settings for each request you make to the generative service. See the HarmBlockThreshold API reference for details.

Safety feedback

generateContent returns a GenerateContentResponse which includes safety feedback.

Prompt feedback is included in promptFeedback. If promptFeedback.blockReason is set, then the content of the prompt was blocked.

Response candidate feedback is included in finishReason and safetyRatings. If response content was blocked and the finishReason was SAFETY, you can inspect safetyRatings for more details. The safety rating includes the category and the probability of the harm classification. The content that was blocked is not returned.

The probability returned correspond to the block confidence levels as shown in the following table:

Probability Description
NEGLIGIBLE Content has a negligible probability of being unsafe
LOW Content has a low probability of being unsafe
MEDIUM Content has a medium probability of being unsafe
HIGH Content has a high probability of being unsafe

For example, if the content was blocked due to the harassment category having a high probability, the safety rating returned would have category equal to HARASSMENT and harm probability set to HIGH.

Safety settings in Google AI Studio

You can also adjust safety settings in Google AI Studio, but you cannot turn them off. To do so, in the Run settings, click Edit safety settings:

Safety settings button

And use the knobs to adjust each setting:

Safety settings button

A No Content message appears if the content is blocked. To see more details, hold the pointer over No Content and click Safety.

Code examples

This section shows how to use the safety settings in code using the python client library.

Request example

The following is a python code snippet showing how to set safety settings in your GenerateContent call. This sets the harm categories Harassment and Hate speech to BLOCK_LOW_AND_ABOVE which blocks any content that has a low or higher probability of being harassment or hate speech.

from google.generativeai.types import HarmCategory, HarmBlockThreshold

model = genai.GenerativeModel(model_name='gemini-1.5-flash')
response = model.generate_content(
    ['Do these look store-bought or homemade?', img],
    safety_settings={
        HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
        HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
    }
)

Response example

The following shows a code snippet for parsing the safety feedback from the response.

try:
  print(response.text)
except ValueError:
  # If the response doesn't contain text, check if the prompt was blocked.
  print(response.prompt_feedback)
  # Also check the finish reason to see if the response was blocked.
  print(response.candidates[0].finish_reason)
  # If the finish reason was SAFETY, the safety ratings have more details.
  print(response.candidates[0].safety_ratings)

Next steps

  • See the API reference to learn more about the full API.
  • Review the safety guidance for a general look at safety considerations when developing with LLMs.
  • Learn more about assessing probability versus severity from the Jigsaw team
  • Learn more about the products that contribute to safety solutions like the Perspective API.
  • You can use these safety settings to create a toxicity classifier. See the classification example to get started.