Google just dropped an experimental open model that breaks the autoregressive paradigm — and it’s another sign of Google’s AI momentum ramping up fast. It’s called DiffusionGemma. It generates text via diffusion — parallel 256-token “canvases” that iteratively denoise — instead of the usual token-by-token approach.
The result: up to 4x faster inference on local GPUs. 1000+ tokens per second on an H100. 700+ on an RTX 5090. Fits in 18GB RAM at 4-bit quantization. Apache 2.0 license.
Oh, and it’s multimodal. 256K context. 140+ languages. Processes images and 60-second video clips. The Google blog barely mentions any of this — Unsloth’s docs spill the real spec sheet.
I’ve been tracking local LLM tooling for a while. This is the first diffusion language model you can actually download and run on consumer hardware today. That’s a milestone, even if the quality isn’t “best in class.”
What DiffusionGemma Actually Is
DiffusionGemma is a 26B-parameter Mixture of Experts model (4B active per forward pass) built on the Gemma 4 family with a Gemini Diffusion research head. Google announced it June 10, 2026.
Instead of predicting the next token, it generates a 256-token “canvas” in parallel, then iteratively denoises it over up to 48 steps. Every token attends to every other token in that canvas — bi-directional attention. It self-corrects via adaptive stopping at an entropy threshold of 0.005.
It also supports Gemma 4-style thinking mode with a <|think|> token that emits an internal reasoning channel before the final answer.
The Hardware Reality
| Quantization | RAM/VRAM Needed | Target Hardware |
|---|---|---|
| Q4_K_M / UD-Q4_K_XL | 15–17 GB | RTX 5090, RTX 4090 (24GB), H100 |
| Q8_0 | 27–29 GB | Dual 24GB+ GPUs, Mac Studio M1 Max/Ultra |
| BF16 / FP16 | 52 GB | Datacenter only |
The 4-bit Q4_K_M is the sweet spot for consumer gear. Unsloth hosts the GGUF at unsloth/diffusiongemma-26B-A4B-it-GGUF with Apache 2.0 confirmed on the model card.
Critical catch: You can’t just ollama run diffusiongemma. Standard llama.cpp doesn’t support the diffusion sampler yet. You need a custom build with PR #24423 (entropy_bounded_denoising sampler, linear_decay temperature 0.8→0.4, adaptive stopping). Unsloth Studio support is WIP.
Benchmarks: The Quality Trade-off Is Real
Google explicitly says: “For applications that demand maximum quality, we recommend deploying standard Gemma 4.” Unsloth’s benchmarks confirm it. It’s a similar playbook to how Google has been positioning against OpenAI — speed and accessibility over peak intelligence.
Text (vs Gemma 4 26B-A4B): – MMLU Pro: 77.6% vs 82.6% – Codeforces ELO: 1429 vs 1718 – LiveCodeBench v6: 69.1% vs 77.1% – GPQA Diamond: 73.2% vs 82.3% – AIME 2026: 69.1% vs 88.3% – MRCR 8-needle 128K: 32.0% vs 44.1%
Where DiffusionGemma edges ahead: – HLE no tools: 11.0% vs 8.7% – AIME 2025 (Google blog): 23.3% vs 20.0%
Vision (vs Gemma 4 26B-A4B): – MMMU Pro: 54.3% vs 73.8% – OmniDocBench: 0.319 vs 0.149 (lower better) – MATH-Vision: 70.5% vs 82.4%
This isn’t a “better model.” It’s a faster model for specific workloads.
What It’s Actually Good For
Speed-critical interactive local workflows. That’s the positioning, and the benchmarks back it.
Strongest use cases: – In-line editing and code infilling (bi-directional attention = every token sees future tokens) – Rapid iteration / non-linear text structures – OCR, document parsing, chart understanding (visual token budgets 70–1120) – UI screenshot analysis, video frame analysis (up to 60s at 1fps) – Constraint satisfaction tasks like Sudoku (Unsloth fine-tune demo solves it perfectly) – Agentic workflows where latency kills UX
Don’t use it for: – General reasoning, complex math, heavy coding tasks – Production systems where peak intelligence matters – Drop-in replacement for your current llama.cpp setup
How to Run It (TL;DR)
Easiest path (when ready): Unsloth Studio — auto-configures the diffusion sampler.
Manual path today: Build llama.cpp from source with PR #24423 applied:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
gh pr checkout 24423
cmake -B build -DGGML_CUDA=ON
cmake --build build -j --config Release --target llama-diffusion-cliDownload the model: “bash pip install -U "huggingface_hub[cli]" hf download unsloth/diffusiongemma-26B-A4B-it-GGUF \ --local-dir unsloth/diffusiongemma-26B-A4B-it-GGUF \ --include "Q4_K_M" “
Run with diffusion sampler: “bash ./build/bin/llama-diffusion-cli \ -m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \ -ngl 99 -cnv -n 2048 “
Add --diffusion-visual to watch the canvas denoise in real time. It’s wild.
Tony’s Take
This is the first diffusion LLM you can actually download and run locally on consumer hardware. That matters.
The multimodal reveal is the real story. Google’s blog buried it — “text generation” in the headline, 256K context and vision in the Unsloth docs. This isn’t a “fast text model.” It’s a fast multimodal model that happens to use diffusion.
The quality trade-off is honest. Google says “use Gemma 4 for production.” Respect that. DiffusionGemma is for interactive speed, not peak intelligence.
The inference friction is real. You can’t just run it today without a custom llama.cpp build. That limits adoption to tinkerers — for now.
Apache 2.0 changes the calculus. Commercial use = builders can actually ship products on this. That’s rare for experimental Google drops.
What to Watch Next
- llama.cpp PR #24423 merge + release — enables standard local inference
- Unsloth Studio diffusiongemma support — removes config friction
- Google org HF model cards — 2B/9B variants would be more accessible
- Community benchmarks on RTX 4090/5090, Mac Studio M-series
- Fine-tuning results beyond Sudoku (code, vision, reasoning)
- Any Google follow-up: Gemma 4 Diffusion? API access?
Bottom Line
If you’ve got an RTX 5090/4090 and want to tinker with the fastest local LLM on the planet — bookmark the Unsloth docs and watch the llama.cpp PR. For everyone else: wait for the tooling to catch up, or stick with Gemma 4 for actual work.
Sources: Google Blog announcement, Unsloth DiffusionGemma docs, Hugging Face model card (Unsloth org), Google DeepMind Gemini Diffusion page, @googlegemma X announcement.



