The Infrastructure Gap — Model Release ≠ Infrastructure Readiness¶
The insight¶
One of the most practical findings from Phase 1 is that a model release does not mean the surrounding inference infrastructure is ready on the same day. This is a structural property of open-source AI infrastructure, not a criticism of any vendor.
The specific gap I hit¶
Gemma 4 released 2026-04-02. I started building 2026-04-08. One week after release:
| Component | Status | Gap |
|---|---|---|
| vLLM engine | Day-0 support in v0.19.0 ✓ | None |
| HuggingFace transformers | gemma4 model type NOT in 4.57.6 (the current PyPI release) |
Requires >=4.58 or install from git |
| NVIDIA Triton Inference Server | vLLM backend in 26.03 built against vLLM 0.17.1; upgrading breaks the backend | Blocked until Triton 26.04 (late April) |
| NVIDIA Gemma 4 NVFP4 quantization | Published on HuggingFace ✓ | None — NVIDIA was ready Day-0 |
| HuggingFace model gating | No gate — Apache 2.0, free download ✓ | None — policy change from Gemma 1-3 |
The model itself worked. The engine that runs it worked. But the ecosystem around it — the model-config library, the serving orchestrator, the container images — lagged by days to weeks.
Why this matters for edge deployments¶
Air-gapped and sovereign edge deployments must plan for this gap because:
-
They can't just
pip installfrom PyPI on a classified network. Every dependency must be pre-staged and validated. If the validated transformers version is 4.57.6 and the model needs 4.58, the model is unusable until the next validation cycle. -
They pin infrastructure versions for stability. A Federal site running Triton 26.03 won't upgrade to 26.04 on Day-0 — they'll wait for their own validation, which might be 26.06 or later.
-
The gap is structural, not accidental. The model, the inference engine, the model-config library, and the serving orchestrator are four independent release trains maintained by four different teams (Google DeepMind, vLLM project, HuggingFace, NVIDIA). Synchronizing them is nobody's job.
How gemma-forge handles it¶
The approach: maintain the ability to compose components at different release cadences rather than pinning to a single vendor's stack.
gemma-forge/vllm:latestis a derived Dockerfile that decouples the transformers version from the vLLM container version- The harness talks the OpenAI-compatible API, which is stable across vLLM versions and will be the same when Triton catches up
- The systemd units are swappable between vLLM-direct and Triton without touching the harness code
- Model weights are stored in a host-level catalog (
/data/triton/weights/) that outlives any single container version
Key artifacts¶
docs/whitepaper/notes.md→ "The infrastructure gap" section- ADR-0014 → documents the Triton gap and the workaround
- Memory:
project_triton_version_gap.md infra/vllm/Dockerfile→ the derived image that bridges the gap