The Model Strategy — bf16, NVFP4, and the VRAM Reality¶

I tried four different configurations of the Gemma 4 31B model on L4 GPUs before finding the one that actually works, and the answer (NVFP4 + tp=2) turned out to be both the most practical AND the most compelling demo story.

What I planned (ADR-0015 original)¶

The original model lineup followed the official vLLM Gemma 4 recipe verbatim: - Architect + Worker: Gemma 4 31B-IT, bf16, tensor_parallel_size=2 - Auditor: Gemma 4 E4B, bf16 - Sentry: Gemma 4 E2B, bf16

The recipe says tp=2 for the 31B. I assumed this meant 2 L4s.

The VRAM reality¶

The recipe was written for A100/H100 (80 GB per GPU). On L4 (24 GB per GPU):

Config	VRAM per GPU	Fits on L4?	Result
31B bf16, tp=2	~31 GB	No (31 > 24)	OOM
31B bf16, pp=2	~31 GB	No	OOM
31B bf16, tp=4	~15.5 GB	Yes but uses ALL GPUs	No room for E4B/E2B
31B NVFP4, tp=1	~22 GB	No (fills L4, no KV cache)	OOM
31B NVFP4, tp=2	~11 GB	Yes, 13 GB headroom	15.1 tok/s

The bf16 31B is 62 GB. Even split across 2 L4s (48 GB combined), 31 GB per GPU > 24 GB capacity. The official recipe simply doesn't work on L4 hardware.

The NVFP4 discovery¶

The question was whether NVFP4 quantization could actually run on L4s.

NVIDIA published nvidia/Gemma-4-31B-IT-NVFP4 on HuggingFace, produced by modelopt v0.37.0. Key findings:

NVIDIA's NVFP4 recipe keeps ALL 60 self-attention layers at full bf16 precision. Only MLP/FFN layers are quantized to FP4. This is a quality-preserving recipe — attention is where reasoning chains live.
On-disk: 31 GB. In-VRAM: 22 GB. Not the naive 15.5 GB estimate (31B × 4 bits / 8). The full-precision attention is the difference.
On a single L4: OOM. 22 GB model fills the 24 GB GPU completely, leaving nothing for KV cache.
With tp=2: 11 GB per GPU, 13 GB headroom. This is the sweet spot.

The quality test¶

I tested the NVFP4 31B with a STIG remediation prompt (V-257844: SSH FIPS key exchange). The model produced: - A structured remediation plan with backup, fix, rollback, validation - Knew FIPS-validated key exchange algorithms by name - Mentioned fips-mode-setup --check as a precondition - Included sshd -t for config syntax validation - Provided the exact ssh -vv validation command

NVFP4 didn't degrade the reasoning. The recipe's attention-preservation strategy works.

TP vs PP comparison¶

The question was: is this the simple stretching of an LLM between two L4s, or the approach where some layers load on the first GPU and the rest on the second?

Tensor Parallelism (TP): every layer split across both GPUs, 60 all-reduce operations per forward pass over PCIe
Pipeline Parallelism (PP): layers stacked, 1 transfer per forward pass

On non-NVLink L4s, PP should be better (less PCIe traffic). But:

PP=2 crashed with IntermediateTensors compatibility bug in vLLM 0.19.0's Gemma 4 implementation. TP=2 works correctly at 15.1 tok/s.

The 15.1 tok/s throughput is 3× faster than human reading speed — suitable for interactive agent workflows and live demos.

Measured results¶

31B NVFP4, tp=2: 15.1 tok/s, 21.9 GB/GPU, 180s cold load
E4B bf16: 19-26 tok/s, 20.5 GB, 150s cold load
E2B bf16: fast (sub-second for 106 tokens), 20.6 GB, 90s cold load
31B bf16, tp=2: OOM (31 GB/GPU > 24 GB L4 capacity)
31B NVFP4, tp=1: OOM (22 GB model, no room for KV cache)
31B NVFP4, pp=2: vLLM bug (IntermediateTensors)

Key artifacts¶

ADR-0015 (revised) — the full model lineup with measured data
docs/whitepaper/notes.md — raw measured results section
nvidia/Gemma-4-31B-IT-NVFP4 — the NVIDIA-published model we use
Memory: project_triton_version_gap.md — context for future sessions