DiffusionGemma: Google's Fastest Local LLM Runs on Your GPU

Google just dropped an experimental open model that breaks the autoregressive paradigm — and it’s another sign of Google’s AI momentum ramping up fast. It’s called DiffusionGemma. It generates text via diffusion — parallel 256-token “canvases” that iteratively denoise — instead of the usual token-by-token approach.

The result: up to 4x faster inference on local GPUs. 1000+ tokens per second on an H100. 700+ on an RTX 5090. Fits in 18GB RAM at 4-bit quantization. Apache 2.0 license.

Meet DiffusionGemma ⚡ Our latest experimental open model (Apache 2.0) that generates text up to 4x faster.

Instead of predicting and typing just one word at a time like most language models, it drafts and refines entire blocks of text simultaneously.

Here’s how it works 🧵 ↓ pic.twitter.com/RhtatPMpvG
— Google (@Google) June 10, 2026

Oh, and it’s multimodal. 256K context. 140+ languages. Processes images and 60-second video clips. The Google blog barely mentions any of this — Unsloth’s docs spill the real spec sheet.

I’ve been tracking local LLM tooling for a while. This is the first diffusion language model you can actually download and run on consumer hardware today. That’s a milestone, even if the quality isn’t “best in class.”

What DiffusionGemma Actually Is

DiffusionGemma is a 26B-parameter Mixture of Experts model (4B active per forward pass) built on the Gemma 4 family with a Gemini Diffusion research head. Google announced it June 10, 2026.

Instead of predicting the next token, it generates a 256-token “canvas” in parallel, then iteratively denoises it over up to 48 steps. Every token attends to every other token in that canvas — bi-directional attention. It self-corrects via adaptive stopping at an entropy threshold of 0.005.

It also supports Gemma 4-style thinking mode with a <|think|> token that emits an internal reasoning channel before the final answer.

The Hardware Reality

Quantization	RAM/VRAM Needed	Target Hardware
Q4_K_M / UD-Q4_K_XL	15–17 GB	RTX 5090, RTX 4090 (24GB), H100
Q8_0	27–29 GB	Dual 24GB+ GPUs, Mac Studio M1 Max/Ultra
BF16 / FP16	52 GB	Datacenter only

The 4-bit Q4_K_M is the sweet spot for consumer gear. Unsloth hosts the GGUF at unsloth/diffusiongemma-26B-A4B-it-GGUF with Apache 2.0 confirmed on the model card.

Critical catch: You can’t just ollama run diffusiongemma. Standard llama.cpp doesn’t support the diffusion sampler yet. You need a custom build with PR #24423 (entropy_bounded_denoising sampler, linear_decay temperature 0.8→0.4, adaptive stopping). Unsloth Studio support is WIP.

Benchmarks: The Quality Trade-off Is Real

Google explicitly says: “For applications that demand maximum quality, we recommend deploying standard Gemma 4.” Unsloth’s benchmarks confirm it. It’s a similar playbook to how Google has been positioning against OpenAI — speed and accessibility over peak intelligence.

Text (vs Gemma 4 26B-A4B): – MMLU Pro: 77.6% vs 82.6% – Codeforces ELO: 1429 vs 1718 – LiveCodeBench v6: 69.1% vs 77.1% – GPQA Diamond: 73.2% vs 82.3% – AIME 2026: 69.1% vs 88.3% – MRCR 8-needle 128K: 32.0% vs 44.1%

Where DiffusionGemma edges ahead: – HLE no tools: 11.0% vs 8.7% – AIME 2025 (Google blog): 23.3% vs 20.0%

Vision (vs Gemma 4 26B-A4B): – MMMU Pro: 54.3% vs 73.8% – OmniDocBench: 0.319 vs 0.149 (lower better) – MATH-Vision: 70.5% vs 82.4%

This isn’t a “better model.” It’s a faster model for specific workloads.

What It’s Actually Good For

Speed-critical interactive local workflows. That’s the positioning, and the benchmarks back it.

Strongest use cases: – In-line editing and code infilling (bi-directional attention = every token sees future tokens) – Rapid iteration / non-linear text structures – OCR, document parsing, chart understanding (visual token budgets 70–1120) – UI screenshot analysis, video frame analysis (up to 60s at 1fps) – Constraint satisfaction tasks like Sudoku (Unsloth fine-tune demo solves it perfectly) – Agentic workflows where latency kills UX

Don’t use it for: – General reasoning, complex math, heavy coding tasks – Production systems where peak intelligence matters – Drop-in replacement for your current llama.cpp setup

How to Run It (TL;DR)

Easiest path (when ready): Unsloth Studio — auto-configures the diffusion sampler.

Manual path today: Build llama.cpp from source with PR #24423 applied:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
gh pr checkout 24423
cmake -B build -DGGML_CUDA=ON
cmake --build build -j --config Release --target llama-diffusion-cli

Download the model: “bash pip install -U "huggingface_hub[cli]" hf download unsloth/diffusiongemma-26B-A4B-it-GGUF \ --local-dir unsloth/diffusiongemma-26B-A4B-it-GGUF \ --include "Q4_K_M" “

Run with diffusion sampler: “bash ./build/bin/llama-diffusion-cli \ -m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \ -ngl 99 -cnv -n 2048 “

Add --diffusion-visual to watch the canvas denoise in real time. It’s wild.

Tony’s Take

This is the first diffusion LLM you can actually download and run locally on consumer hardware. That matters.

The multimodal reveal is the real story. Google’s blog buried it — “text generation” in the headline, 256K context and vision in the Unsloth docs. This isn’t a “fast text model.” It’s a fast multimodal model that happens to use diffusion.

The quality trade-off is honest. Google says “use Gemma 4 for production.” Respect that. DiffusionGemma is for interactive speed, not peak intelligence.

The inference friction is real. You can’t just run it today without a custom llama.cpp build. That limits adoption to tinkerers — for now.

Apache 2.0 changes the calculus. Commercial use = builders can actually ship products on this. That’s rare for experimental Google drops.

What to Watch Next

llama.cpp PR #24423 merge + release — enables standard local inference
Unsloth Studio diffusiongemma support — removes config friction
Google org HF model cards — 2B/9B variants would be more accessible
Community benchmarks on RTX 4090/5090, Mac Studio M-series
Fine-tuning results beyond Sudoku (code, vision, reasoning)
Any Google follow-up: Gemma 4 Diffusion? API access?

Bottom Line

If you’ve got an RTX 5090/4090 and want to tinker with the fastest local LLM on the planet — bookmark the Unsloth docs and watch the llama.cpp PR. For everyone else: wait for the tooling to catch up, or stick with Gemma 4 for actual work.

Reviewed & Written By

Tony Simons

Independent tech reviewer and creator of Tony Reviews Things. 14 years of hands-on testing, software auditing, and workflow automation. I test the gear so you don't waste your money on junk.

About Me How I Test

Google’s DiffusionGemma Is the Fastest Local LLM You Can Run — And It Sees Images Too

What DiffusionGemma Actually Is

The Hardware Reality

Benchmarks: The Quality Trade-off Is Real

What It’s Actually Good For

How to Run It (TL;DR)

Tony’s Take

What to Watch Next

Bottom Line

Tony Simons

Submit a Take Cancel reply

What DiffusionGemma Actually Is

The Hardware Reality

Benchmarks: The Quality Trade-off Is Real

What It’s Actually Good For

How to Run It (TL;DR)

Tony’s Take

What to Watch Next

Bottom Line

Tony Simons

Submit a Take Cancel reply

Related signals

Nvidia, Microsoft, and Meta Tell Washington: Don’t Kill Open-Weight AI

Meta AI Agent Can Now Access Your Email and Calendar

Claude Opus 5 Is Here, and Anthropic Wants It to Be Your Daily Driver