Overview
This guide describes the PaLM API adjustable safety settings available for the text service. During the prototyping stage, you can adjust safety settings on six dimensions to quickly assess if your application requires more or less restrictive configuration. By default, safety settings block content with medium and/or high probability of being unsafe content across all six dimensions. This baseline safety is designed to work for most use cases, so you should only adjust your safety settings if it's consistently required for your application.
Safety filters
In addition to the adjustable safety filters, the PaLM API has built-in protections against core harms, such as content that endangers child safety. These types of harm are always blocked and cannot be adjusted.
The adjustable safety filters cover the following categories:
- Derogatory
- Toxic
- Sexual
- Violent
- Medical
- Dangerous
These settings allow you, the developer, to determine what is appropriate for your use case. For example, if you're building a video game dialogue, you may deem it acceptable to allow more content that's rated as violent or dangerous due to the nature of the game. Here are a few other example use cases that may need some flexibility in these safety settings:
Use Case | Category |
---|---|
Anti-Harassment Training App | Derogatory, Sexual, Toxic |
Medical Exam Study Pal | Medical |
Screenplay Writer | Violent, Sexual, Medical, Dangerous |
Toxicity classifier | Toxic, Derogatory |
Probability vs severity
The PaLM API blocks content based on the probability of content being unsafe and not the severity. This is important to consider because some content can have low probability of being unsafe even though the severity of harm could still be high. For example, comparing the sentences:
- The robot punched me.
- The robot slashed me up.
Sentence 1 might result in a higher probability of being unsafe but you might consider sentence 2 to be a higher severity in terms of violence.
Given this, it is important for each developer to carefully test and consider what the appropriate level of blocking is needed to support their key use cases while minimizing harm to end users.
Safety Settings
Safety settings are part of the request you send to the text service. It can be adjusted for each request you make to the API. The following table lists the categories that you can set and describes the type of harm that each category encompasses.
Categories | Descriptions |
---|---|
Derogatory | Negative or harmful comments targeting identity and/or protected attributes. |
Toxic | Content that is rude, disrespectful, or profane. |
Sexual | Contains references to sexual acts or other lewd content. |
Violent | Describes scenarios depicting violence against an individual or group, or general descriptions of gore. |
Dangerous | Promotes, facilitates, or encourages harmful acts. |
Medical | Content that is related to medical topics |
You can see these definitions in the API reference as well.
The following table describes the block settings you can adjust for each category. For example, if you set the block setting to Block few for the Derogatory category, everything that has a high probability of being derogatory content is blocked. But anything with a lower probability is allowed.
If not set, the default block setting is Block some or Block most depending on the policy category.
Threshold (Google AI Studio) | Threshold (API) | Description |
---|---|---|
Block none | BLOCK_NONE | Always show regardless of probability of unsafe content |
Block few | BLOCK_ONLY_HIGH | Block when high probability of unsafe content |
Block some (Default for sexual, violent, dangerous and medical) | BLOCK_MEDIUM_AND_ABOVE | Block when medium or high probability of unsafe content |
Block most (Default for Derogatory and toxicity) | BLOCK_LOW_AND_ABOVE | Block when low, medium or high probability of unsafe content |
HARM_BLOCK_THRESHOLD_UNSPECIFIED | Threshold is unspecified, block using default threshold |
You can set these settings for each request you make to the text service. See
the
HarmBlockThreshold
API reference for details.
Safety feedback
If content was blocked, the response from the API contains the reason it was
blocked in the
ContentFilter.reason
field. If
the reason was related to safety, then the response also contains a
SafetyFeedback
field which includes the safety settings that were used for that request as well
as a safety rating. The safety rating includes the category and the probability
of the harm classification. The content that was blocked is not returned.
The probability returned correspond to the block confidence levels as shown in the following table:
Probability | Description |
---|---|
NEGLIGIBLE | Content has a negligible probability of being unsafe |
LOW | Content has a low probability of being unsafe |
MEDIUM | Content has a medium probability of being unsafe |
HIGH | Content has a high probability of being unsafe |
For example, if the content was blocked due to the toxicity category having a
high probability, the safety rating returned would have category equal to
TOXICITY
and harm probability set to HIGH
.
Safety settings in Google AI Studio
You can set these settings in Google AI Studio as well. In the Run settings, click the Edit safety settings:
And use the knobs to adjust each setting:
A
No Content message appears if the content is blocked. To see more details, hold the pointer over No Content and click Safety.Code examples
This section shows how to use the safety settings in code using the python client library.
Request example
The following is a python code snippet showing how to set safety settings in
your GenerateText
call. This sets the harm categories Derogatory
and
Violence
to BLOCK_LOW_AND_ABOVE
which blocks any content that has a low
or higher probability of being violent or derogatory.
completion = genai.generate_text(
model=model,
prompt=prompt,
safety_settings=[
{
"category": safety_types.HarmCategory.HARM_CATEGORY_DEROGATORY,
"threshold": safety_types.HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
},
{
"category": safety_types.HarmCategory.HARM_CATEGORY_VIOLENCE,
"threshold": safety_types.HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
},
]
)
Response example
The following shows a code snippet for parsing the safety feedback from the response. Note that the safety feedback will be empty unless the reason for blocking was one of the safety dimensions.
# First check the content filter reason
for filter in completion.filters:
print(filter["reason"])
# If any of the reason is "safety", then the safety_feedback field will be
# populated
for feedback in completion.safety_feedback:
print(feedback["rating"])
print(feedback["setting"])s
Next steps
- See the API reference to learn more about the full API.
- Review the safety guidance for a general look at safety considerations when developing with LLMs.
- Learn more about assessing probability versus severity from the Jigsaw team
- Learn more about the products that contribute to safety solutions like the Perspective API.
- You can use these safety settings to create a toxicity classifier. See the classification example to get started.