id: journey-08-model-architecture-revision type: journey title: "Journey: Model Architecture Revision — From "One GPU Per Role" to "Right Model Per Role"" date: 2026-04-10 tags: [L3-model, L4-orchestration, refactor] related: - journey/02-model-strategy - journey/11-the-missing-reflector one_line: "I started with four models assigned to four GPUs because there were four GPUs, recognized it as hardware-first thinking, and redesigned around what each agent role actually needs."
Journey: Model Architecture Revision — From "One GPU Per Role" to "Right Model Per Role"¶
The story in one sentence¶
I started with four models assigned to four GPUs because there were four GPUs, caught it as hardware-first thinking, and redesigned around what each agent role actually NEEDS.
What was wrong¶
The original lineup (ADR-0015) assigned models by size: - GPU 0+1: 31B Dense (biggest) → Architect + Worker - GPU 2: E4B (medium) → Auditor - GPU 3: E2B (smallest) → Sentry (never wired in)
Three problems surfaced:
-
Every GPU was utilized just because it was there. The assignment was hardware-first, not architecture-first. The Sentry GPU was loaded but idle — pure waste.
-
The Auditor would benefit from increased intelligence. The Auditor makes the HARDEST decision in the loop (keep or revert) but ran on the second-weakest model. That's like giving a code reviewer a junior developer's brain.
-
The "audit" was mostly a pass/fail test. The Auditor's entire job was: call healthcheck, read "HEALTHY" or "UNHEALTHY", decide. A bash script could do that. The LLM was a wrapper around a three-word string.
The architectural insight¶
The roles in the Ralph loop have fundamentally different cognitive needs:
| Role | Cognitive task | What it needs |
|---|---|---|
| Architect | Planning, strategy, selection | Strong reasoning, broad knowledge |
| Worker | Code generation, tool use | Strong code gen, structured output |
| Auditor | Evaluation, judgment, verification | Different perspective from the creator |
The Auditor doesn't need to be SMARTER than the Architect. It needs to THINK DIFFERENTLY. Same model evaluating its own work has the same blind spots. A different model family catches systematic biases.
This is the red team / blue team principle applied to agentic AI.
The revised architecture¶
| GPU(s) | Role | Model | Why |
|---|---|---|---|
| 0+1 | Architect + Worker | Gemma 4 31B NVFP4 (Google) | Flagship for planning + code gen |
| 2 | Auditor | Nemotron-3-Nano-30B NVFP4 (NVIDIA) | Different model family for cross-evaluation |
| 3 | Available | — | Future skills / mission-flexible |
The expanded Auditor¶
The Auditor was also redesigned from a liveness checker to a real auditor:
Old: check_health → HEALTHY/UNHEALTHY → pass/fail New: check_health + stig_check_rule + read_recent_journal + revert
The expanded Auditor: 1. Checks mission app health (liveness) 2. Re-scans the specific STIG rule to verify the fix WORKED 3. Reads recent journal entries for side effects 4. Makes a judgment call with reasoning 5. Reverts if any of the above fail
This justifies the stronger model — a bash script can check liveness, but reasoning about whether a journal warning is a real problem or benign noise requires actual intelligence.
Why Nemotron specifically¶
- NVIDIA-official NVFP4 — not a community quant, defensible for Federal
- Different model family — Llama 3.3 derivative, different training data from Gemma
- 30B total / 3B active — MoE architecture, fast per-token inference
- Trained with 33% synthetic data for tool calling — directly relevant
- Fits on single L4 — ~16-17 GB NVFP4, leaves headroom for KV cache
- ~7-12% behind Gemma 4 on benchmarks — acceptable for evaluation role
Why GPU 3 stays free¶
"Available" beats forcing a role. The demo story: "3 GPUs running 2 model families for 3 agent roles, with a 4th GPU available for additional skills. This XR7620 isn't maxed out — it has room for the next mission."
Having headroom to grow is more important than seeing a config 100% utilized.