Understand and count tokens

Gemini and other generative AI models process input and output at a granularity called a token.

For Gemini models, a token is equivalent to about 4 characters. 100 tokens is equal to about 60-80 English words.

About tokens

Tokens can be single characters like z or whole words like cat. Long words are broken up into several tokens. The set of all tokens used by the model is called the vocabulary, and the process of splitting text into tokens is called tokenization.

When billing is enabled, the cost of a call to the Gemini API is determined in part by the number of input and output tokens, so knowing how to count tokens can be helpful.


Context windows

The models available through the Gemini API have context windows that are measured in tokens. The context window defines how much input you can provide and how much output the model can generate. You can determine the size of the context window by calling the getModels endpoint or by looking in the models documentation.

In the following example, you can see that the gemini-2.0-flash model has an input limit of about 1,000,000 tokens and an output limit of about 8,000 tokens, which means a context window is 1,000,000 tokens.

ctx := context.Background()
client, err := genai.NewClient(ctx, &genai.ClientConfig{
	APIKey:  os.Getenv("GEMINI_API_KEY"),
	Backend: genai.BackendGeminiAPI,
})
if err != nil {
	log.Fatal(err)
}
prompt := "The quick brown fox jumps over the lazy dog."

// Convert prompt to a slice of *genai.Content using the helper.
contents := []*genai.Content{
	genai.NewContentFromText(prompt, genai.RoleUser),
}
countResp, err := client.Models.CountTokens(ctx, "gemini-2.0-flash", contents, nil)
if err != nil {
	return err
}
fmt.Println("total_tokens:", countResp.TotalTokens)

response, err := client.Models.GenerateContent(ctx, "gemini-2.0-flash", contents, nil)
if err != nil {
	log.Fatal(err)
}
usageMetadata, err := json.MarshalIndent(response.UsageMetadata, "", "  ")
if err != nil {
	log.Fatal(err)
}
fmt.Println(string(usageMetadata))

Count tokens

All input to and output from the Gemini API is tokenized, including text, image files, and other non-text modalities.

You can count tokens in the following ways:

  • Call CountTokens with the input of the request.
    This returns the total number of tokens in the input only. You can make this call before sending the input to the model to check the size of your requests.

  • Use the UsageMetadata attribute on the response object after calling GenerateContent.
    This returns the total number of tokens in both the input and the output: total_token_count.
    It also returns the token counts of the input and output separately: prompt_token_count (input tokens) and candidates_token_count (output tokens).

    If you are using a thinking model like the 2.5 ones, the token used during the thinking process are returned in thoughts_token_count. And if you are using Context caching, the cached token count will be in cached_content_token_count.

Count text tokens

If you call CountTokens with a text-only input, it returns the token count of the text in the input only (total_tokens). You can make this call before calling GenerateContent to check the size of your requests.

Another option is calling GenerateContent and then using the UsageMetadata attribute on the response object to get the following:

  • The separate token counts of the input (prompt_token_count), the cached content (cached_content_token_count) and the output (candidates_token_count)
  • The token count for the thinking process (thoughts_token_count)
  • The total number of tokens in both the input and the output (total_token_count)
ctx := context.Background()
client, err := genai.NewClient(ctx, &genai.ClientConfig{
	APIKey:  os.Getenv("GEMINI_API_KEY"),
	Backend: genai.BackendGeminiAPI,
})
if err != nil {
	log.Fatal(err)
}
prompt := "The quick brown fox jumps over the lazy dog."

// Convert prompt to a slice of *genai.Content using the helper.
contents := []*genai.Content{
	genai.NewContentFromText(prompt, genai.RoleUser),
}
countResp, err := client.Models.CountTokens(ctx, "gemini-2.0-flash", contents, nil)
if err != nil {
	return err
}
fmt.Println("total_tokens:", countResp.TotalTokens)

response, err := client.Models.GenerateContent(ctx, "gemini-2.0-flash", contents, nil)
if err != nil {
	log.Fatal(err)
}
usageMetadata, err := json.MarshalIndent(response.UsageMetadata, "", "  ")
if err != nil {
	log.Fatal(err)
}
fmt.Println(string(usageMetadata))

Count multi-turn (chat) tokens

If you call CountTokens with the chat history, it returns the total token count of the text from each role in the chat (total_tokens).

Another option is calling SendMessage and then using the UsageMetadata attribute on the response object to get the following:

  • The separate token counts of the input (prompt_token_count), the cached content (cached_content_token_count) and the output (candidates_token_count)
  • The token count for the thinking process (thoughts_token_count)
  • The total number of tokens in both the input and the output (total_token_count)

To understand how big your next conversational turn will be, you need to append it to the history when you call CountTokens.

ctx := context.Background()
client, err := genai.NewClient(ctx, &genai.ClientConfig{
	APIKey:  os.Getenv("GEMINI_API_KEY"),
	Backend: genai.BackendGeminiAPI,
})
if err != nil {
	log.Fatal(err)
}

// Initialize chat with some history.
history := []*genai.Content{
	{Role: genai.RoleUser, Parts: []*genai.Part{{Text: "Hi my name is Bob"}}},
	{Role: genai.RoleModel, Parts: []*genai.Part{{Text: "Hi Bob!"}}},
}
chat, err := client.Chats.Create(ctx, "gemini-2.0-flash", nil, history)
if err != nil {
	log.Fatal(err)
}

firstTokenResp, err := client.Models.CountTokens(ctx, "gemini-2.0-flash", chat.History(false), nil)
if err != nil {
	log.Fatal(err)
}
fmt.Println(firstTokenResp.TotalTokens)

resp, err := chat.SendMessage(ctx, genai.Part{
	Text: "In one sentence, explain how a computer works to a young child."},
)
if err != nil {
	log.Fatal(err)
}
fmt.Printf("%#v\n", resp.UsageMetadata)

// Append an extra user message and recount.
extra := genai.NewContentFromText("What is the meaning of life?", genai.RoleUser)
hist := chat.History(false)
hist = append(hist, extra)

secondTokenResp, err := client.Models.CountTokens(ctx, "gemini-2.0-flash", hist, nil)
if err != nil {
	log.Fatal(err)
}
fmt.Println(secondTokenResp.TotalTokens)

Count multimodal tokens

All input to the Gemini API is tokenized, including text, image files, and other non-text modalities. Note the following high-level key points about tokenization of multimodal input during processing by the Gemini API:

  • With Gemini 2.0, image inputs with both dimensions <=384 pixels are counted as 258 tokens. Images larger in one or both dimensions are cropped and scaled as needed into tiles of 768x768 pixels, each counted as 258 tokens. Prior to Gemini 2.0, images used a fixed 258 tokens.

  • Video and audio files are converted to tokens at the following fixed rates: video at 263 tokens per second and audio at 32 tokens per second.

Media resolutions

Gemini 3 Pro Preview introduces granular control over multimodal vision processing with the media_resolution parameter. The media_resolution parameter determines the maximum number of tokens allocated per input image or video frame. Higher resolutions improve the model's ability to read fine text or identify small details, but increase token usage and latency.

For more details about the parameter and how it can impact token calculations, see the media resolution guide.

Image files

If you call CountTokens with a text-and-image input, it returns the combined token count of the text and the image in the input only (total_tokens). You can make this call before calling GenerateContent to check the size of your requests. You can also optionally call CountTokens on the text and the file separately.

Another option is calling GenerateContent and then using the UsageMetadata attribute on the response object to get the following:

  • The separate token counts of the input (prompt_token_count), the cached content (cached_content_token_count) and the output (candidates_token_count)
  • The token count for the thinking process (thoughts_token_count)
  • The total number of tokens in both the input and the output (total_token_count)

Example that uses an uploaded image from the File API:

ctx := context.Background()
client, err := genai.NewClient(ctx, &genai.ClientConfig{
	APIKey:  os.Getenv("GEMINI_API_KEY"),
	Backend: genai.BackendGeminiAPI,
})
if err != nil {
	log.Fatal(err)
}

file, err := client.Files.UploadFromPath(
	ctx, 
	filepath.Join(getMedia(), "organ.jpg"), 
	&genai.UploadFileConfig{
		MIMEType : "image/jpeg",
	},
)
if err != nil {
	log.Fatal(err)
}
parts := []*genai.Part{
	genai.NewPartFromText("Tell me about this image"),
	genai.NewPartFromURI(file.URI, file.MIMEType),
}
contents := []*genai.Content{
	genai.NewContentFromParts(parts, genai.RoleUser),
}

tokenResp, err := client.Models.CountTokens(ctx, "gemini-2.0-flash", contents, nil)
if err != nil {
	log.Fatal(err)
}
fmt.Println("Multimodal image token count:", tokenResp.TotalTokens)

response, err := client.Models.GenerateContent(ctx, "gemini-2.0-flash", contents, nil)
if err != nil {
	log.Fatal(err)
}
usageMetadata, err := json.MarshalIndent(response.UsageMetadata, "", "  ")
if err != nil {
	log.Fatal(err)
}
fmt.Println(string(usageMetadata))

Video or audio files

Audio and video are each converted to tokens at the following fixed rates:

  • Video: 263 tokens per second
  • Audio: 32 tokens per second

If you call CountTokens with a text-and-video/audio input, it returns the combined token count of the text and the video/audio file in the input only (total_tokens). You can make this call before calling GenerateContent to check the size of your requests. You can also optionally call CountTokens on the text and the file separately.

Another option is calling GenerateContent and then using the UsageMetadata attribute on the response object to get the following:

  • The separate token counts of the input (prompt_token_count), the cached content (cached_content_token_count) and the output (candidates_token_count)
  • The token count for the thinking process (thoughts_token_count)
  • The total number of tokens in both the input and the output (total_token_count)
ctx := context.Background()
client, err := genai.NewClient(ctx, &genai.ClientConfig{
	APIKey:  os.Getenv("GEMINI_API_KEY"),
	Backend: genai.BackendGeminiAPI,
})
if err != nil {
	log.Fatal(err)
}

file, err := client.Files.UploadFromPath(
	ctx, 
	filepath.Join(getMedia(), "Big_Buck_Bunny.mp4"), 
	&genai.UploadFileConfig{
		MIMEType : "video/mp4",
	},
)
if err != nil {
	log.Fatal(err)
}

// Poll until the video file is completely processed (state becomes ACTIVE).
for file.State == genai.FileStateUnspecified || file.State != genai.FileStateActive {
	fmt.Println("Processing video...")
	fmt.Println("File state:", file.State)
	time.Sleep(5 * time.Second)

	file, err = client.Files.Get(ctx, file.Name, nil)
	if err != nil {
		log.Fatal(err)
	}
}

parts := []*genai.Part{
	genai.NewPartFromText("Tell me about this video"),
	genai.NewPartFromURI(file.URI, file.MIMEType),
}
contents := []*genai.Content{
	genai.NewContentFromParts(parts, genai.RoleUser),
}

tokenResp, err := client.Models.CountTokens(ctx, "gemini-2.0-flash", contents, nil)
if err != nil {
	log.Fatal(err)
}
fmt.Println("Multimodal video/audio token count:", tokenResp.TotalTokens)
response, err := client.Models.GenerateContent(ctx, "gemini-2.0-flash", contents, nil)
if err != nil {
	log.Fatal(err)
}
usageMetadata, err := json.MarshalIndent(response.UsageMetadata, "", "  ")
if err != nil {
	log.Fatal(err)
}
fmt.Println(string(usageMetadata))

System instructions and tools

System instructions and tools also count towards the total token count for the input.

If you use system instructions, the total_tokens count increases to reflect the addition of SystemInstruction.

If you use function calling, the total_tokens count increases to reflect the addition of tools.