Gotcha: NVFP4 31B is 22 GB in VRAM, not the naive 15.5 GB estimate¶

Symptom¶

Attempting to load nvidia/Gemma-4-31B-IT-NVFP4 on a single L4 (24 GB VRAM):

ERROR: Failed to load model - not enough GPU memory.
GPU 0 has a total capacity of 22.03 GiB of which 17.06 MiB is free.
Process has 22.01 GiB memory in use.

Root cause¶

The naive estimate (31B × 4 bits / 8 bits per byte = ~15.5 GB) is wrong because NVIDIA's NVFP4 recipe does NOT quantize everything.

Per the model's config.json → quantization_config: - Quantized (FP4): MLP/FFN layers only (weights AND activations, group_size=16) - NOT quantized (stays bf16): ALL 60 self-attention layers, the LM head, and the vision encoder

The self-attention layers account for roughly 30-40% of the model's parameters. At bf16 (2 bytes per param), they contribute significantly to the VRAM footprint. The actual in-VRAM usage:

Component	Precision	Approx VRAM
MLP/FFN weights	FP4	~8-9 GB
Self-attention weights	bf16	~10-12 GB
LM head + embeddings	bf16	~1-2 GB
Runtime overhead	—	~1 GB
Total	—	~22 GB

This leaves zero room for KV cache on a 24 GB L4.

Why NVIDIA did this¶

Keeping attention at full precision is a quality-preserving design choice. Self-attention is where reasoning chains, long-range dependencies, and contextual coherence live. Quantizing attention degrades these capabilities more than quantizing the MLP layers.

We confirmed this empirically: the NVFP4 model produced Architect-grade STIG remediation plans with proper reasoning chains, structured output, and accurate domain knowledge.

The fix¶

Use tensor_parallel_size=2 with NVFP4. This splits the 22 GB across 2 L4s → ~11 GB per GPU → 13 GB headroom for KV cache. This is actually the best configuration for L4 hardware: the quantization makes it POSSIBLE to fit on 2 GPUs (bf16 at 62 GB doesn't fit even on 2 L4s), while the tp=2 provides enough headroom for useful context.

The lesson¶

Never estimate VRAM from model parameter count alone. Always check: 1. The quantization_config in config.json — what's actually quantized? 2. The ignore list — what stays at full precision? 3. Test empirically — VRAM usage after loading is the ground truth

Environment¶

Model: nvidia/Gemma-4-31B-IT-NVFP4 (modelopt v0.37.0)
On-disk: 31 GB (4 safetensors shards)
In-VRAM: 22 GB (measured via nvidia-smi)
GPU: NVIDIA L4, 23,034 MiB (22.5 GiB)