Analyze model behavior with interpretability tools

While a responsible approach to AI should include safety policies, techniques to improve model's safety, how to build transparency artifacts, your approach to being responsible with generative AI should not simply be to follow a checklist. Generative AI products are relatively new and the behaviors of an application can vary more than earlier forms of software. For this reason, you should probe the machine learning models being used, examine examples of the model's behavior, and investigate surprises.

Today, prompting is as much art as it is science, but there are tools that can help you empirically improve prompts for large language models, such as the Learning Interpretability Tool (LIT). LIT is an open-source platform developed for visualizing, understanding, and debugging AI/ML models. Below is an example of how LIT can be used to explore Gemma's behavior, anticipate potential issues, and improve its safety.

You can install LIT on your local machine, in Colab, or on Google Cloud. To get started with LIT, import your model and an associated dataset (e.g., a safety evaluation dataset) in Colab. LIT will generate a set of outputs for the dataset using your model and provide you with a user interface to explore the model's behavior.

Analyze Gemma Models with LIT

Start Codelab Start Google Colab

Animation of Learning Interpretability Tool (LIT) user interface

This image shows LIT's user interface. The Datapoint Editor at the top allows users to edit their prompts. At the bottom, the LM Salience module allows them to check saliency results.

Identify errors in complex prompts

Two of the most important prompting techniques for high quality LLM-based prototypes and applications are few-shot prompting (including examples of the desired behavior in the prompt) and chain-of-thought, including a form of explanation or reasoning before the final output of the LLM. However, creating an effective prompt is often still challenging.

Consider an example of helping someone assess if they will like a food based on their tastes. An initial prototype chain-of-thought prompt-template might look like this:

Analyze a menu item in a restaurant.


## For example:


Taste-likes: I've a sweet-tooth
Taste-dislikes: Don't like onions or garlic
Suggestion: Onion soup
Analysis: it has cooked onions in it, which you don't like.
Recommendation: You have to try it.


Taste-likes: I've a sweet-tooth
Taste-dislikes: Don't like onions or garlic
Suggestion: Baguette maison au levain
Analysis: Home-made leaven bread in France is usually great
Recommendation: Likely good.


Taste-likes: I've a sweet-tooth
Taste-dislikes: Don't like onions or garlic
Suggestion: Macaron in France
Analysis: Sweet with many kinds of flavours
Recommendation: You have to try it.


## Now analyse one more example:


Taste-likes: {{users-food-like-preferences}}
Taste-dislikes: {{users-food-dislike-preferences}}
Suggestion: {{menu-item-to-analyse}}
Analysis:

Did you spot any issues with this prompt? LIT will help you examine the prompt with the LM Salience module.

Use sequence salience for debugging

Salience is computed at the smallest possible level (i.e., for each input token), but LIT can aggregate token-saliency into more interpretable larger spans, such as lines, sentences, or words. Learn more about saliency and how to use it to identify unintended biases in our Interactive Saliency Explorable.

Let's start by giving the prompt a new example input for the prompt-template variables:

{{users-food-like-preferences}} = Cheese
{{users-food-dislike-preferences}} = Can't eat eggs
{{menu-item-to-analyse}} = Quiche Lorraine

Once this is done, a surprising model completion can be observed:

Taste-likes: Cheese
Taste-dislikes: Can't eat eggs
Suggestion: Quiche Lorraine
Analysis: A savoury tart with cheese and eggs
Recommendation: You might not like it, but it's worth trying.

Why is the model suggesting you eat something that you clearly said you can't eat?

Sequence salience can help highlight the root problem, which is in our few-shot examples. In the first example, the chain-of-thought reasoning in the analysis section doesn't match the final recommendation. An analysis of "It has cooked onions in it, which you don't like" is paired with a recommendation of "You have to try it".

LIT user interface showing prompt sequence saliency analysis

This highlights an error in the initial prompt: there was an accidental copy of the recommendation (You have to try it!) for the first few-shot example. You can see the saliency-strength in the prompt from the darkness of the purple highlight. The highest saliency is on the first few-shot example, and specifically on the lines corresponding to Taste-likes, Analysis, and Recommendation. This suggests that the model is using these lines most to make its final incorrect recommendation.

This example also highlights that early prototyping can reveal risks you might not think of ahead of time, and the error-prone nature of language models means that you have to proactively design for errors. This is discussed further in our People + AI Guidebook for designing with AI.

Test hypotheses to improve model behavior

LIT lets you to test changes to prompts within the same interface. In this instance, try adding a constitution to improve model behavior. Constitutions refer to design prompts with principles to help guide the model's generation. Recent methods even enable interactive derivation of constitutional principles.

Let's use this idea to help improve the prompt further. Use LIT's Datapoint Editor to add a section with the principles for the generation at the top of our prompt, which now starts as follows:

Analyze a menu item in a restaurant.

* The analysis should be brief and to the point.
* It should provide a clear statement of suitability for someone with
  specific dietary restrictions.
* It should reflect the person's tastes

## For example:

Taste-likes: I've a sweet-tooth
Taste-dislikes: Don't like onions or garlic
Suggestion: Onion soup
Analysis: it has cooked onions in it, which you don't like.
Recommendation: Avoid.

With this update the example can be rerun and observe a very different output:

Taste-likes: Cheese
Taste-dislikes: Can't eat eggs
Suggestion: Quiche Lorraine
Analysis: This dish contains eggs, which you can't eat.
Recommendation: Not suitable for you.

The prompt-saliency can then be re-examined to help get a sense of why this change is happening:

LIT user interface showing prompt saliency analysis

In this example, "Not suitable for you" is being influenced by the principle of "Provide a clear statement of suitability for someone with specified dietary restriction" and the explanatory analysis statement noting that the dish contains eggs (the so-called chain of thought).

Include non-technical teams in model probing and exploration

Interpretability is meant to be a team effort, spanning expertise across policy, legal, and more. As you've seen, LIT's visual medium and interactive ability to examine salience and explore examples can help different stakeholders share and communicate findings. This can enable you to bring in a broader diversity of teammates for model exploration, probing, and debugging. Exposing them to these technical methods can enhance their understanding of how models work. In addition, a more diverse set of expertise in early model testing can also help uncover undesirable outcomes that can be improved.

Summary

When you find problematic examples in your model evaluations, bring them into LIT for debugging. Start by analyzing the largest sensible unit of content you can think of that logically relates to the modeling task, use the visualizations to see where the model is correctly or incorrectly attending to the prompt content, and then drill down into smaller units of content to further describe the incorrect behavior you're seeing in order to identify possible fixes.

Developer resources