Introducing Google AI Edge Portal: Benchmark Edge AI at scale. Sign-up to request access during private preview.

Gemma 4

Gemma 4 models are designed to deliver frontier-level performance at each size, targeting deployment scenarios from mobile and edge devices (E2B, E4B) to consumer GPUs and workstations (26B A4B, 31B). They are well-suited for reasoning, agentic workflows, coding, and multimodal understanding.

Gemma 4 is licensed under the Apache-2.0 license. For more details, see the Gemma 4 Model Card.

🔴 What's New: Multi-Token Prediction

Multi-Token Prediction (MTP) is a new performance optimization that significantly accelerates decode speeds across CPU and GPU backends with zero quality degradation.

Performance Gains:
- GPU: Massive acceleration, delivering up to 2.2x decode speedup on mobile GPUs.
- CPU: Performance boosts up to 1.5x speedup on mobile CPUs and significant acceleration on SME-enabled hardware (e.g., M4 MacBooks).
Recommendations: MTP is universally recommended for all tasks on GPU backends and for the Gemma4-E4B model on CPU. For the Gemma4-E2B model on CPU, it is highly valuable for rewrite, summarize, and coding tasks, but should be enabled selectively as it may cause a slight slowdown during freeform prompting or generative tasks.

To try it out, see the platform-specific guides:

Get Started

Chat with Gemma4-E2B, hosted on the Hugging Face LiteRT Community.

uv tool install litert-lm

litert-lm run  \
  --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
  --prompt="What is the capital of France?"

Deploy from Safetensors

Follow these steps to deploy Gemma 4 starting from your custom safetensors (for example, after fine-tuning the model for your use-case):

Convert to a .litertlm format:

uv tool install litert-torch-nightly

litert-torch export_hf \
  --model=google/gemma-4-E2B-it \
  --output_dir=/tmp/gemma4_2b \
  --externalize_embedder \
  --jinja_chat_template_override=litert-community/gemma-4-E2B-it-litert-lm

Deploy using LiteRT-LM cross-platform APIs:

litert-lm run  \
  /tmp/gemma4_2b/model.litertlm \
  --prompt="What is the capital of France?"

Performance Summary

Gemma-4-E2B

Model Size: 2.58 GB

Additional technical details are in the HuggingFace model card

Platform (Device)	Backend	Prefill (tk/s)	Decode (tk/s)	Time to First Token (seconds)	Peak CPU Memory (MB)
Android (S26 Ultra)	CPU	557	47	1.8	1733
Android (S26 Ultra)	GPU	3808	52	0.3	676
iOS (iPhone 17 Pro)	CPU	532	25	1.9	607
iOS (iPhone 17 Pro)	GPU	2878	56	0.3	1450
Linux (Arm 2.3 & 2.8 GHz, NVIDIA GeForce RTX 4090)	CPU	260	35	4	1628
Linux (Arm 2.3 & 2.8 GHz, NVIDIA GeForce RTX 4090)	GPU	11234	143	0.1	913
macOS (MacBook Pro M4)	CPU	901	42	1.1	736
macOS (MacBook Pro M4)	GPU	7835	160	0.1	1623
Windows (Intel LunarLake)	CPU	435	30	2.4	3505
Windows (Intel LunarLake)	GPU	3751	48	0.3	3540
IoT (Raspberry Pi 5 16GB)	CPU	133	8	7.8	1546

Gemma-4-E4B

Model Size: 3.65 GB

Additional technical details are in the HuggingFace model card

Platform (Device)	Backend	Prefill (tk/s)	Decode (tk/s)	Time to First Token (seconds)	Peak CPU Memory (MB)
Android (S26 Ultra)	CPU	195	18	5.3	3283
Android (S26 Ultra)	GPU	1293	22	0.8	710
iOS (iPhone 17 Pro)	CPU	159	10	6.5	961
iOS (iPhone 17 Pro)	GPU	1189	25	0.9	3380
Linux (Arm 2.3 & 2.8GHz / RTX 4090)	CPU	82	18	12.6	3139
Linux (Arm 2.3 & 2.8GHz / RTX 4090)	GPU	7260	91	0.2	1119
macOS (MacBook Pro M4 Max)	CPU	277	27	3.7	890
macOS (MacBook Pro M4 Max)	GPU	2560	101	0.4	3217
Windows (Intel LunarLake)	CPU	173	17	6.0	9372
Windows (Intel LunarLake)	GPU	1202	25	0.9	7147
IoT (Raspberry Pi 5 16GB)	CPU	51	3	20.5	3069