The Infrastructure Gap — Model Release ≠ Infrastructure Readiness¶

The insight¶

One of the most practical findings from Phase 1 is that a model release does not mean the surrounding inference infrastructure is ready on the same day. This is a structural property of open-source AI infrastructure, not a criticism of any vendor.

The specific gap I hit¶

Gemma 4 released 2026-04-02. I started building 2026-04-08. One week after release:

Component	Status	Gap
vLLM engine	Day-0 support in v0.19.0 ✓	None
HuggingFace transformers	`gemma4` model type NOT in 4.57.6 (the current PyPI release)	Requires `>=4.58` or install from git
NVIDIA Triton Inference Server	vLLM backend in 26.03 built against vLLM 0.17.1; upgrading breaks the backend	Blocked until Triton 26.04 (late April)
NVIDIA Gemma 4 NVFP4 quantization	Published on HuggingFace ✓	None — NVIDIA was ready Day-0
HuggingFace model gating	No gate — Apache 2.0, free download ✓	None — policy change from Gemma 1-3

The model itself worked. The engine that runs it worked. But the ecosystem around it — the model-config library, the serving orchestrator, the container images — lagged by days to weeks.

Why this matters for edge deployments¶

Air-gapped and sovereign edge deployments must plan for this gap because:

They can't just pip install from PyPI on a classified network. Every dependency must be pre-staged and validated. If the validated transformers version is 4.57.6 and the model needs 4.58, the model is unusable until the next validation cycle.
They pin infrastructure versions for stability. A Federal site running Triton 26.03 won't upgrade to 26.04 on Day-0 — they'll wait for their own validation, which might be 26.06 or later.
The gap is structural, not accidental. The model, the inference engine, the model-config library, and the serving orchestrator are four independent release trains maintained by four different teams (Google DeepMind, vLLM project, HuggingFace, NVIDIA). Synchronizing them is nobody's job.

How gemma-forge handles it¶

The approach: maintain the ability to compose components at different release cadences rather than pinning to a single vendor's stack.

gemma-forge/vllm:latest is a derived Dockerfile that decouples the transformers version from the vLLM container version
The harness talks the OpenAI-compatible API, which is stable across vLLM versions and will be the same when Triton catches up
The systemd units are swappable between vLLM-direct and Triton without touching the harness code
Model weights are stored in a host-level catalog (/data/triton/weights/) that outlives any single container version

Key artifacts¶

docs/whitepaper/notes.md → "The infrastructure gap" section
ADR-0014 → documents the Triton gap and the workaround
Memory: project_triton_version_gap.md
infra/vllm/Dockerfile → the derived image that bridges the gap