[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["缺少我需要的資訊","missingTheInformationINeed","thumb-down"],["過於複雜/步驟過多","tooComplicatedTooManySteps","thumb-down"],["過時","outOfDate","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["示例/程式碼問題","samplesCodeIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2024-10-23 (世界標準時間)。"],[],[],null,["# Safeguard your models\n\n\u003cbr /\u003e\n\nGenerative artificial intelligence (GenAI) products are relatively new and\ntheir behaviors can vary more than earlier forms of software. The safeguards\nthat protect your product from misuse of GenAI capabilities must adapt in\nkind. This guide describes how you can employ content policy compliance\ncheckers and watermarking tools to protect your GenAI-enabled products.\n\nContent policy compliance\n-------------------------\n\nEven with prior [tuning for safety](/responsible/docs/alignment#tuning) and a well designed\n[prompt template](/responsible/docs/alignment#prompting), it is possible for your GenAI\nproduct to output content that results in unintended harm. GenAI products often\nrely on *input and output filtering* to ensure responsible model behavior. These\ntechniques check the data going into or coming out of the model complies with\nyour [policies](/responsible/docs/design#define-policies), often by performing additional\n[safety training](/responsible/docs/alignment#tuning) to create a content classifier model.\n\nInput classifiers are used to filter content that is directly or which might\ninduce your model generate content that violates your content policies. Input\nfilters often target adversarial attacks that try to circumvent your content\npolicies.\n\nOutput classifiers filter model output, catching generated content that violates\nyour safety policies. Careful monitoring of your content rejection behaviors can\nsurface new classes of prompts that can be used to augment or improve input\nfilters.\n\nIt's recommended to have classifiers that cover all of your content policies.\nYou may be able to achieve this using [ready-made classifiers](#ready-made), or\nyou may need to create [custom classifiers](#agile-classifiers) that support\nyour specific policies.\n\nBalance is also key. Over-filtering can result in unintended harm, or reduce the\nutility of the application; be sure to review the cases where over-filtering may\nbe happening. See the [safety evaluation guide](/responsible/docs/evaluation) for more.\n\n### Ready-made content policy classifiers\n\nReady-made content classifiers add an additional layer of protection to the\nmodel's inherent safety training, further mitigating the potential for certain\ntypes of policy violations. They generally come in two varieties:\n\n1. **Self-hosted classifiers** , such as [ShieldGemma](/responsible/docs/safeguards/shieldgemma), can be downloaded and hosted on a variety of architectures, including Cloud platforms like Google Cloud, privately-owned hardware, and some classifiers can even run on-device for mobile applications.\n2. **API-based classifiers** are provided as services that offer high-volume, low-latency classification against a variety of policies. Google provides three services that may be of interest:\n - [Checks AI Safety](https://checks.google.com/ai-safety/?utm_source=GenAITK&utm_medium=Link&utm_campaign=AI_Toolkit) provides compliance assessments and dashboards supporting model evaluation and monitoring. The AI Safety tool is in open beta, [sign up](https://checks.google.com/onboarding/?utm_source=GenAITK&utm_medium=Link&utm_campaign=AITK_Onboarding) for news, access, and demos.\n - The [Text Moderation Service](https://cloud.google.com/natural-language/docs/moderating-text) is a Google Cloud API that analyzes text for safety violations, including harmful categories and sensitive topics, subject to [usage rates](https://cloud.google.com/natural-language/pricing).\n - The [Perspective API](https://developers.perspectiveapi.com/) is a free API that uses machine learning models to score the perceived impact a comment might have on a conversation. It provides scores that capture the probability of whether a comment is toxic, threatening, insulting, or off-topic.\n\nIt's important to evaluate how well ready-made classifiers meet your policy\ngoals, and qualitatively evaluate the failure cases.\n\n### Custom content policy classifiers\n\nReady-made content policy classifiers are an excellent start, but they have\nlimitations, including:\n\n- A fixed policy taxonomy that may not map to or cover all of your content policies.\n- Hardware and connectivity requirements that may not be appropriate for the environment your GenAI-powered application will be deployed in.\n- Pricing and other usage restrictions.\n\nCustom content policy classifiers may be one way to address these limitations,\nand the [agile classifiers](/responsible/docs/safeguards/agile-classifiers) method provides an\nefficient and flexible framework for creating them. As this method tunes a model\nfor safety purposes, be sure to review the\n[model tuning basics](/responsible/docs/alignment#tuning).\n\nIdentify AI-generated content with SynthID Text watermarks\n----------------------------------------------------------\n\nGenAI can generate a wider array of highly diverse content at scales previously\nunimagined. While the majority of this use is for legitimate purposes, there is\nconcern that it could contribute to misinformation and misattribution problems.\nWatermarking is one technique for mitigating these potential impacts. Watermarks\nthat are imperceptible to humans can be applied to AI-generated content, and\ndetection models can score arbitrary content to indicate the likelihood that it\nhas been watermarked.\n\n[SynthID](https://deepmind.google/technologies/synthid/) is a Google DeepMind technology that watermarks and\nidentifies AI-generated content by embedding digital watermarks directly into\nAI-generated images, audio, text or video. [SynthID Text](/responsible/docs/safeguards/synthid) is\navailable for production in [Hugging Face Transformers](https://huggingface.co/blog/synthid-text), check out\nthe [research paper](https://www.nature.com/articles/s41586-024-08025-4) and [docs](/responsible/docs/safeguards/synthid) to learn more\nabout how to use SynthID in your application.\n\n[Google Cloud](https://cloud.google.com/blog/products/ai-machine-learning/vertex-ai-next-2023-announcements) provides SynthID watermarking capabilities for\nother modalities, such as [Imagen-generated imagery](https://cloud.google.com/blog/products/ai-machine-learning/a-developers-guide-to-imagen-3-on-vertex-ai),\nto Vertex AI customers.\n\nBest practices for setting up safeguards\n----------------------------------------\n\nUsing safety classifiers as safeguards is strongly recommended. However,\nguardrails can result in the generative model not producing anything for the\nuser, if the content is blocked. Applications need to be designed to handle this\ncase. Most popular chatbots handle this by providing canned answers (\"I am\nsorry, I am a language model, I can't help you with this request\").\n\n**Find the right balance between helpfulness and harmlessness**: When using\nsafety classifiers, it is important to understand that they will make mistakes,\nincluding both false positives (e.g. claiming an output is unsafe when it is\nnot) and false negatives (failing to label an output as unsafe, when it is). By\nevaluating classifiers with metrics like F1, Precision, Recall, and AUC-ROC, you\ncan determine how you would like to tradeoff false positive versus false\nnegative errors. By changing the threshold of classifiers, you help find an\nideal balance that avoids over-filtering outputs while still providing\nappropriate safety.\n\n**Check your classifiers for unintended biases:** Safety classifiers, like any\nother ML model, can propagate unintended biases, such as socio-cultural\nstereotypes. Applications need to be appropriately evaluated for potentially\nproblematic behaviors. In particular, content safety classifiers can\nover-trigger on content related to identities that are more frequently the\ntarget of abusive language online. As an example, when the Perspective API was\nfirst launched, the model returned higher toxicity scores in comments\nreferencing certain identity groups ([blog](https://medium.com/jigsaw/unintended-bias-and-names-of-frequently-targeted-groups-8e0b81f80a23)). This over-triggering\nbehavior can happen because comments that mention identity terms for more\nfrequently targeted groups (e.g., words like \"Black\", \"muslim\", \"feminist\",\n\"woman\", \"gay\", etc.) are more often toxic in nature. When datasets used to\ntrain classifiers have significant imbalances for comments containing certain\nwords, classifiers can overgeneralize and consider all comments with those words\nas being likely to be unsafe. Read how the Jigsaw team\n[mitigated](https://medium.com/jigsaw/identifying-machine-learning-bias-with-updated-data-sets-7c36d6063a2c) this unintended bias.\n\nDeveloper Resources\n-------------------\n\n- [SynthID](https://deepmind.google/technologies/synthid/): Tools for watermarking and identifying AI-generated content.\n- [Checks AI Safety](https://checks.google.com/ai-safety/?utm_source=GenAITK&utm_medium=Link&utm_campaign=AI_Toolkit): AI safety compliance.\n- [Perspective API](https://developers.perspectiveapi.com/): To identify toxic content.\n- [Text moderation service](https://cloud.google.com/natural-language/docs/moderating-text): For Google Cloud customers."]]