Google's Gemma 4 is Apache 2.0, and That Changes the Local Model Conversation

Google just released Gemma 4, and the headline feature isn’t a benchmark number. It’s the license.

Gemma 4 ships under Apache 2.0. Not a custom license with restrictions buried in the fine print, not a “you can use it but don’t compete with us” arrangement. A proper, no-strings Apache 2.0 license. You can take these models, fine-tune them, deploy them commercially, modify them however you want. For a model family this capable coming out of Google, that’s a first.

The timing is interesting too. Some of the Chinese open model providers have been pulling back on openness with their latest releases. Google appears to be moving in the opposite direction.

What’s actually in the box

Gemma 4 is a family of four models split into two tiers.

The workstation tier includes a 31 billion parameter dense model and a 26 billion parameter mixture-of-experts (MoE) model. The MoE variant uses 128 tiny experts but only activates about 3.8 billion parameters at any given time, so you get roughly 27B-class intelligence at closer to 4B compute costs. Both workstation models support a 256K context window.

The edge tier is the E2B and E4B, designed to run on phones, Raspberry Pis, and Jetson Nanos. These are the two models in the family that support audio natively, and they come with a 128K context window.

All four models include vision support, chain-of-thought reasoning, and function calling baked in at the architecture level.

The license matters more than you think

We’ve been running local models for clients in situations where data can’t leave their infrastructure. Until now, that meant weighing up trade-offs between model quality and licensing restrictions. Llama had its own usage restrictions. Qwen had questions around its license terms for commercial use. Mistral had different terms again.

Gemma 3 was capable, but the license pushed a lot of developers toward alternatives. Gemma 4 under Apache 2.0 removes that friction entirely. If you’re building a product, running a local deployment, or fine-tuning for a specific use case, you don’t need to check with a lawyer first.

For the kind of work we do at WebArt Design, building custom AI features into client products, that simplifies things considerably. We can pick models based on capability and fit, without worrying about whether the license will cause problems twelve months later.

Multimodality built in, not bolted on

Previous generations of open models generally handled text well and maybe offered basic vision support. If you wanted audio, you were bolting on Whisper or some other ASR pipeline as a separate system. Function calling was usually a matter of hoping the model cooperated with your prompt template.

Gemma 4 builds all of this in natively. Vision, audio (on the edge models), reasoning, and function calling are part of the architecture, not afterthoughts.

The vision encoder has been redesigned with native aspect ratio processing. You can feed in a document, a screenshot, or a photo at its actual dimensions, and the model handles it properly. That matters for OCR, document understanding, and any workflow where you’re processing real-world images rather than resized squares.

The audio encoder on the edge models has been compressed significantly compared to the previous Gemma 3n, dropping from 681 million parameters down to 305 million and from 390MB on disk to 87MB. Frame duration went from 160ms down to 40ms, which should make transcription noticeably more responsive. The edge models also support speech-to-translated-text, so you can speak in one language and get output in another, all on-device.

Function calling that actually works for agents

This is the part that matters most if you’re building agentic systems. Previous open models handled function calling by being good at instruction following and then hoping for the best. Gemma 4 has function calling optimised from scratch for multi-turn agentic workflows with multiple tools.

The benchmark results back this up. On the τ2-bench agentic tool use benchmark (retail), the 31B model hits 86.4% and the 26B MoE hits 85.5%. For comparison, Gemma 3 27B scored 6.6% on the same test. That’s not an incremental improvement. The model went from barely functional at tool use to competitive with commercial APIs.

We’ve been building custom AI interfaces for clients with tens of thousands of daily users, and reliable function calling is one of the hardest problems to solve with open models. If the Gemma 4 numbers hold up in production, it could replace commercial API calls for a lot of the agentic workflows we currently run through hosted providers.

Running it locally

Google is releasing quantization-aware training (QAT) checkpoints alongside the model weights. This means the quantised versions maintain higher quality than post-training quantisation would give you, which matters when you’re deploying on consumer hardware.

The MoE model at 26B total parameters with only 3.8B active should be very runnable on consumer GPUs. It’s already available on Ollama, LM Studio, Hugging Face, and Kaggle. The 31B dense model will need more headroom, but Google is positioning it as a workstation-class model for local coding assistants, IDE copilots, or small-server deployments serving multiple users.

The edge models are very small. The E2B can run on a phone with near-zero latency, completely offline. If you’re building an on-device voice assistant or any edge application where data can’t touch the cloud, these are the models to test first.

What we’re watching

The benchmark numbers are strong. The 31B model scores 89.2% on AIME 2026 (mathematics), 80% on LiveCodeBench v6 (competitive coding), and 84.3% on GPQA Diamond (scientific knowledge). The MoE model trails only slightly behind on most of these, which is impressive given the difference in active parameters.

But benchmarks only tell part of the story. What we care about is how these models behave in production, under real load, with real users asking unpredictable things. We’ll be running the MoE model through our own evaluation pipeline over the coming weeks, particularly for function calling reliability and multi-turn conversation coherence.

Google says Gemma 4 is built from their Gemini 3 research, meaning architecture innovations from their flagship commercial models are now available in the open weights versions. If that claim holds up in practice, this could shift the calculus on when it makes sense to run a local model versus calling a commercial API.

What this means if you’re buying, not building

If you’ve been avoiding open models because of licensing headaches or capability gaps, Gemma 4 removes most of those objections. Apache 2.0 licensing, multimodal support that doesn’t require bolting on extra systems, function calling that actually works, and models that run on hardware you already own.

It won’t replace commercial APIs for everything. But for applications where data needs to stay on your infrastructure, or where per-inference costs are eating into margins, Gemma 4 makes the local deployment option a lot more credible than it was six months ago.

If you’re weighing up local model options or building AI features into your product, we’re happy to walk through whether Gemma 4 fits your setup.

Google's Gemma 4 is Apache 2.0, and That Changes the Local Model Conversation

What’s actually in the box

The license matters more than you think

Multimodality built in, not bolted on

Function calling that actually works for agents

Running it locally

What we’re watching

What this means if you’re buying, not building

Related Posts

The Hidden Costs of Legacy Databases (And How We Modernize Them)

Flutter's 2026 Roadmap Just Dropped, and It's All About Finishing the Job

Why Custom API Middleware is Safer Than Zapier for Enterprise Automation

Address

Call Us