Phase 0.5: How Should We Actually Serve Gemma 4?¶

The 96 GB of aggregate VRAM across four L4s was plenty for Gemma 4 31B on paper, so the question was never "does it fit" but "how should we inference it" — which precision, which parallelism strategy, and what throughput profile would suit an agentic harness making many small calls over long runs.

Why this was a separate phase¶

Even though the model was going to fit, the shape of how it fit mattered a lot. An agentic harness calls the model constantly: Architect planning, Worker executing, Reflector diagnosing. Latency compounds across hundreds of turns, and memory layout determines whether other workloads can share the box. Phase 0.5 was scoped as just the serving question: spin up Gemma 4 on the XR7620, measure actual memory footprints at different precisions and tensor parallelism settings, and pick a configuration that left room for the rest of the work.

The question and why it was non-trivial¶

The flagship Gemma 4 model in the 2026 release is the 31B dense variant. At bf16 that's roughly 62 GB of weights, not counting KV cache; at NVFP4 the published figure is ~15.5 GB. The XR7620 has 4× NVIDIA L4 at 24 GB each, so there is clearly enough aggregate VRAM either way. What wasn't obvious was how the model should be split across those four GPUs and which precision would give the best quality-per-token-per-second trade.

The complication: no NVLink. The four L4s are on PCIe Gen4 with ~32 GB/s effective inter-GPU bandwidth. Tensor parallelism normally wants much more bandwidth than that, so a wider TP degree looked likely to bleed throughput to all-reduce overhead. The theory suggested sticking to low TP with quantization and leaving two GPUs free for other workloads. The theory was only partly right.

A second angle worth checking: if NVFP4 really did fit in ~15.5 GB, the whole model lived on a single L4 with headroom. That would open up running multiple model instances in parallel on different GPUs and side-stepping tensor parallelism entirely.

The first surprise: the NVFP4 VRAM math is wrong¶

The naive calculation is straightforward. A 31B model at 4-bit precision is 31B × 4 bits / 8 bits per byte = 15.5 GB. Published recipes cited this number. It is, however, incorrect for how NVFP4 actually works on L4-class hardware.

Two things happen that the naive calculation ignores:

The attention layers stay in bf16. NVFP4 is a hybrid quantization: the MLP weights (the bulk of the parameters) are quantized to FP4, but attention is preserved at full precision because attention is where reasoning chains live and quantizing it measurably degrades quality. For a 31B model this means ~20 GB of the weights stay at 16-bit, not 4-bit. The actual VRAM footprint is ~22 GB, not 15.5 GB.
The L4 doesn't have native FP4 tensor cores. L4 is Ada Lovelace (compute 8.9). It can store the FP4 weights compactly, but at computation time it has to dequantize them to bf16 before running the matmul, because that's the lowest precision the tensor cores on this generation support. This means the compute is bf16 regardless of the storage format. NVFP4 saves memory on this hardware but doesn't save compute cycles.

The discovery that NVFP4 is really 22 GB on L4, not 15.5, is documented in gotchas/nvfp4-vram-math. It's the kind of detail that is obvious if you read the modelopt quantization recipe carefully but surprising if you trust the one-line published figure.

What actually worked¶

After testing four configurations on the actual hardware (documented in journey/02-model-strategy), the winning configuration for Phase 0.5 was:

Gemma 4 31B NVFP4, tensor-parallel=2, across GPUs 0+1.

~22 GB per GPU after loading
KV cache fits in the remaining ~2 GB with --max-model-len 4096
15.1 tok/s sustained at 512 output tokens
Cold load time ~180 seconds
GPUs 2+3 left free for other concurrent workloads

This validated Phase 0.5. Gemma 4 31B served well on this hardware at a speed that is 3× faster than human reading speed, with two GPUs still free for other workloads. The question "how should we serve Gemma 4 for an agentic harness on this box" had a concrete answer that the project could build on, and Phase 1 could start.

What Phase 0.5 did NOT validate¶

Important caveats about what this phase was and wasn't:

It did NOT validate any specific architectural choice for the harness. The agentic structure was still open at this point.
It did NOT validate that this model configuration was the final one. See journey/12-bf16-tp4-full-precision for a later discovery — that bf16 tp=4 across all four GPUs matches NVFP4 tp=2 throughput at full precision — which reshaped the ultimate production configuration.
It did NOT validate multi-tenant deployment. Running multiple models concurrently on different GPU groups would come later, after the agent role assignments were understood.

Phase 0.5 answered one question: can the model run at all, at a usable speed, on this hardware. Yes. Move to Phase 1.

What this phase taught us¶

Two things worth writing down for anyone doing similar validation work:

Always test the actual hardware, not the published numbers. The NVFP4 VRAM math was wrong, and we only caught it by loading the model and watching nvidia-smi. Published recipes assume specific GPU generations; if you are on different silicon, the published numbers may not apply.
Phase 0.5 gates are worth the discipline. It's tempting to skip "can it run" validation and go straight to architecture because the answer feels obvious. But when the answer is "actually no, and here is why," discovering it at Phase 0.5 saves you from writing a bunch of architecture that has to be undone. It's cheap discipline with a big downside avoided.

journey/01-inference-layer — the surrounding decision about Triton vs. vLLM as the serving layer
journey/02-model-strategy — the four configurations tested and the winning one
gotchas/nvfp4-vram-math — the atomic gotcha about the VRAM math
journey/12-bf16-tp4-full-precision — the later discovery that reshaped the production configuration