Journey: The Inference Layer Evolution¶
The story in one sentence¶
I started with Triton Inference Server for dynamic model management, discovered it couldn't run Gemma 4 yet, pivoted to plain vLLM containers, and ended up with a cleaner operational model that's ready for Triton when it catches up.
What I planned (ADR-0001, ADR-0014)¶
The original PRD specified Ollama. During the interview phase I rejected Ollama (not production-grade for multi-GPU serving) and picked vLLM as the inference engine. This was validated: vLLM has Day-0 Gemma 4 support, is Apache-2, air-gappable, and is what Red Hat ships as RHAIIS.
Then a critical insight surfaced: vLLM doesn't support dynamic loading and unloading of models. The XR7620 needed to be a multi-demo host where different demos load different model sets without redeploying containers.
This led to the Triton Inference Server with EXPLICIT model control
mode architecture (ADR-0014). Triton wraps vLLM as a backend and adds:
- Runtime model load/unload via REST API
- A shared model catalog (/data/triton/models/)
- systemd-managed processes, one per GPU (ADR-0013)
- The "watch the L4s warm up" demo theater
I researched this thoroughly. A subagent verified:
- Triton vLLM backend exists, maintained by NVIDIA
- EXPLICIT mode mechanics (POST /v2/repository/models/pip install -U vllm inside
the container
What went wrong¶
Triton 26.03 (released 2026-03-27) ships vLLM 0.17.1. Gemma 4
released April 2 — five days AFTER the Triton cut. The model type
gemma4 is not in transformers 4.57.6 or vLLM 0.17.1.
I tried upgrading vLLM inside the Triton container:
- pip install -U vllm → installed vLLM 0.19.0
- But this broke the Triton vLLM backend code:
ModuleNotFoundError: No module named 'vllm.inputs.data'
- The backend at /opt/tritonserver/backends/vllm/utils/request.py
imports internal vLLM modules that were reorganized between 0.17.1
and 0.19.0
- Upgrading vLLM without upgrading the Triton backend code = broken
I also discovered that transformers>=4.58 is needed for the gemma4
model type. The stock transformers 4.57.6 in both Triton and
vllm-openai containers doesn't have it.
What I decided¶
Pivot to plain vllm/vllm-openai containers for serving, with a
custom gemma-forge/vllm:latest Dockerfile that bakes in
transformers>=4.58.
The Triton infrastructure (systemd units, model repository layout, install scripts) stays in the repo as scaffolding for when Triton 26.04 ships. The harness code talks the OpenAI Chat Completions API, which is the same whether Triton or vLLM is behind the endpoint — so the swap is transparent.
Three role-named systemd units were created:
- gemma-forge-architect.service — 31B NVFP4 tp=2, GPUs 0+1, port 8050
- gemma-forge-auditor.service — E4B bf16, GPU 2, port 8060
- gemma-forge-sentry.service — E2B bf16, GPU 3, port 8070
Makefile targets: make demo-up / make demo-down / make demo-status
Verified end-to-end: all 3 endpoints responding, all 4 GPUs loaded,
make demo-down frees all GPUs to 0 MiB, 29 existing Docker containers
untouched.
What I learned¶
-
"Day-0 model support" from one vendor does not guarantee Day-0 support from the rest of the stack. The model, the inference engine, the model-config library (transformers), and the serving orchestrator (Triton) are four independent release trains. Sovereign edge operators must treat them as such.
-
The "shared host service" pattern survived the pivot. Even though the serving layer switched from Triton to plain vLLM, the concept of
/data/triton/as a host-level model catalog, systemd-managed inference services, andmake demo-up/downas the operational interface — all of that transferred cleanly. -
Triton's EXPLICIT mode is the right long-term architecture. When 26.04 ships, the systemd units swap back and runtime model swap via API returns with zero harness code changes. The pivot was tactical, not strategic.
Key artifacts¶
- ADR-0001 (superseded by ADR-0014) — why vLLM as the engine
- ADR-0014 — Triton as shared host service (the target architecture)
infra/vllm/Dockerfile— the production image with transformers fixinfra/vllm/systemd/gemma-forge-*.service— role-named unitsinfra/vllm/scripts/serve.sh— generic vLLM serve wrapper- Memory:
project_triton_version_gap.md— tracks the blocker