Google unveils Gemma 4 12B for local AI agents, coding, and multimodal reasoning


Google DeepMind has introduced Gemma 4 12B, a new open-weight multimodal model designed to bring agentic intelligence directly to laptops with mobile-first efficiency and advanced reasoning.

Gemma 4 12B sits between the edge-friendly E4B model and the larger 26B Mixture-of-Experts (MoE) model, offering strong performance with a reduced memory footprint. It is also the first mid-sized model in the series to feature native audio input support.

The Gemma family has now crossed 150 million downloads, with developers building use cases ranging from wearable robotic arms for physical assistance to enterprise-grade AI security systems.

Key features of Gemma 4 12B

Gemma 4 12B introduces a unified encoder-free multimodal architecture, where vision and audio inputs flow directly into the LLM backbone without separate encoders. This reduces latency and memory overhead compared to traditional multimodal systems.

  • Vision processing: Replaces the vision encoder with a lightweight embedding module using a single matrix multiplication, positional embeddings, and normalizations
  • Audio processing: Removes the audio encoder entirely and projects raw audio signals into the same token space as text

The model delivers benchmark performance close to the 26B MoE model while using less than half the memory footprint, enabling multi-step reasoning and agentic workflows on laptops with 16GB VRAM or unified memory.

Gemma 4 12B is released under the Apache 2.0 license and includes Multi-Token Prediction (MTP) drafters to improve inference speed and reduce latency.

It supports advanced agentic capabilities such as:

  • Autonomous data processing
  • Generating rich visual insights
  • Building fully functional webpages
  • Executing everyday tool use and workflows
  • Multi-step reasoning and structured task execution

A new Gemma Skills Repository is also introduced, providing an official library of reusable skills designed specifically for building agentic systems with Gemma models.

Run state-of-the-art agents locally

Gemma 4 12B delivers near-26B MoE performance on benchmarks while requiring significantly lower memory, making it suitable for:

  • Local AI agents
  • On-device reasoning systems
  • Private offline workflows
  • Edge and laptop-based AI applications
Experience a uniquely efficient unified architecture

Traditional multimodal systems rely on separate encoders for vision and audio, which increases latency and memory usage. Gemma 4 12B removes this limitation through a fully unified design.

  • No separate encoders for vision or audio
  • Direct processing inside LLM backbone
  • Reduced latency and memory consumption
  • Improved cross-modal reasoning consistency

Vision pipeline

Vision is handled through a lightweight embedding module with a single matrix multiplication, positional embeddings, and normalizations, replacing the full vision encoder.

Audio pipeline

Audio is processed by removing the encoder entirely and projecting raw audio signals directly into the same embedding space as text tokens.

Performance Benchmarks

Gemma 4 12B shows performance differences across Linux and macOS GPU environments, measuring prefill speed, decode speed, latency, and memory usage.

Linux

  • Device: AMD Radeon™ AI PRO R9700
  • Backend: GPU
  • Prefill: 662.32 tokens/sec
  • Decode: 66.26 tokens/sec
  • Time-to-first-token: 1.56 sec
  • Model size: 6235 MB
  • GPU Memory: 8064.2 MB

macOS

  • Device: MacBook Pro M4
  • Backend: GPU
  • Prefill: 243.55 tokens/sec
  • Decode: 29.56 tokens/sec
  • Time-to-first-token: 4.2 sec
  • Model size: 6235 MB
  • GPU Memory: 7763 MB
Get started today

Developers can try Gemma 4 12B using:

  • LM Studio
  • Ollama
  • Google AI Edge Gallery
  • Google AI Edge Eloquent
  • LiteRT-LM

They can also:

  • Download weights from Hugging Face and Kaggle
  • Review developer documentation and quick start notebook
  • Use frameworks like Hugging Face Transformers, llama.cpp, MLX, SGLang, and vLLM
  • Fine-tune using Unsloth
  • Spin up production endpoints using Google Cloud

Gemma Skills Repository

The model includes an official Skills Repository, designed to help developers build agentic systems using reusable Gemma capabilities.

Bringing Gemma 4 12B to your laptop

Gemma 4 12B is designed for local execution on everyday machines using the Google AI Edge stack.

This enables:

  • Autonomous data processing
  • Generating rich visual insights
  • Building fully functional webpages
  • Everyday tool execution
  • Fully local agent workflows
Coding and advanced workflows

Gemma 4 12B supports advanced local execution capabilities including:

  • Python code generation from natural language prompts
  • Local execution of scripts and data analysis
  • Automatic chart generation from datasets
  • Self-correcting code generation in a single turn
  • Complex 3D rendering tasks with dependency handling
  • End-to-end webpage generation

In coding tests, the model can generate outputs such as charts from datasets (e.g., comparing baby names across years) and even render 3D scenes with full dependency setup and correction in a single prompt.

Dictation and voice-driven editing

Google AI Edge Eloquent is a fully on-device macOS application that transforms speech into structured writing.

It provides:

  • System-wide voice dictation via hotkeys
  • Fully local transcription of audio and video files
  • Voice-based text editing (Voice Edit feature)

Users can issue commands such as:

  • “Restructure these notes into an executive summary”
  • “Translate this into Hindi”

Gemma 4 12B improves instruction following, scope adherence, and output quality compared to previous models, with a reported 60%+ improvement.

LiteRT-LM and local serving

LiteRT-LM introduces a new serve command, turning it into a drop-in local LLM server.

This allows:

  • Standard API endpoints for local models
  • Drop-in replacement for hosted LLM servers
  • Integration with tools like Continue, Aider, OpenCode, Hermes, and Pi
  • Fully local agent workflows
  • Zero-code model deployment
Deployment options

Gemma 4 12B can be deployed across:

  • LM Studio, Ollama, Hugging Face Transformers, llama.cpp, MLX, SGLang, vLLM
  • Google Cloud endpoints
  • Gemini Enterprise Agent Platform Model Garden
  • Cloud Run and Google Kubernetes Engine (GKE)
Availability

Gemma 4 12B is available as an open-weight model under the Apache 2.0 license and can be downloaded from Hugging Face and Kaggle.

It is optimized for laptops with 16GB memory and supports fully offline multimodal AI workflows.

It is integrated across the Google AI Edge ecosystem, including macOS tools such as AI Edge Gallery, Eloquent, and LiteRT-LM CLI, enabling local-first AI experiences while keeping all data on-device.