The Journey¶

First-person field notes of how gemma-forge was built. Chronological, honest, specific. Each entry is scoped to a single moment in the project — a decision, a discovery, a refactor, or a postmortem — and is meant to be readable on its own.

How to read this¶

Chronological: entries are numbered in the order they happened. Decimal numbers (00.5, 06.5, etc.) mark mid-project entries added retroactively to cover moments the original numbering missed.
Self-contained: each entry starts with a one-sentence hook and a "why this is its own entry" section that explains what this moment is about.
Cross-linked: every entry lists the related entries at the top (via frontmatter) and links to them in the body where relevant.
Tagged: every entry has layer, pattern, moment, and optional domain tags so you can find entries by topic in the site search.

Entries, in order¶

Phase 0 — Starting from scratch¶

00. The Origin of gemma-forge — why Ralph loops, why Gemma 4, why STIG as the anchor, and what the project explicitly is not.
00.5. How Should We Serve Gemma 4? — picking the serving strategy: precision, tensor parallelism, and the NVFP4 VRAM math surprise on L4.

Phase 1 — The inference layer¶

01. The Inference Layer Evolution — Triton was the first choice, we pivoted to vLLM, and we kept the Triton scaffolding for when it catches up.
02. Model Strategy — four configurations of the 31B tested on real hardware, and the one that worked.

Phase 2 — The target VM¶

04. VM Provisioning — OpenTofu + libvirt v0.9.7 + Rocky 9, and an hour of debugging a GRUB hang caused by missing ACPI features.

Phase 3 — The harness¶

06. Tool Calling — getting Gemma 4 to actually call tools through vLLM and ADK, and realizing our first "loop" was a script pretending to be an agent.
06.5. The Stateful Loop Refactor — replacing ADK's LoopAgent with a Python-driven outer loop and fresh per-turn sessions.
07. The Skills System — pulling STIG-specific logic into a skill manifest so other use cases are a folder-copy away.
07.5. Virsh Console Fallback — the out-of-band recovery path for when SSH+sudo is broken, and the honest documentation of its current bug.

Phase 4 — Iterating on the architecture¶

08. Model Architecture Revision — moving away from hardware-first role assignment to judgment-based roles.
09. The Nemotron Experiment — cross-model Auditor role, why it worked technically, and why we walked it back.
10. The Parallelism Maze — every path we tried was blocked by a different constraint until only one option remained.

Phase 5 — Observability¶

03. Observability — the OpenTelemetry stack, the dual-purpose decision.
03.5. The LiteLLM Decision — the March 2026 supply chain incident, and the OTel-pure architecture that came out of it.
05. Infrastructure Gap — what "Day-0 model support" actually means when the surrounding stack hasn't caught up.
12.5. Structured Run Logger — the boring-sounding JSONL decision that became the backbone of everything downstream.

Phase 6 — The reflexion architecture¶

11. The Missing Reflector — realizing three agents wasn't actually reflexion and adding the fourth.
12. bf16 TP=4 Full Precision — the unexpected benchmark result that reshaped our production configuration.
13. The Retry Budget That Wasn't Ralph — replacing the attempt counter with a wall-clock budget.

Phase 7 — The overnight run and its aftermath¶

14. The Overnight Run — 10 hours, 2 rules remediated, 26 escalated, four architectural flaws discovered.
15. The Test as Architecture Discovery — the discipline reframe that turned verification tests into property tests.
15.5. The Test Pass in Practice — 99 tests across 7 tiers, the real bugs caught, the honest gaps.
16. Capturing Lightning — why the journal became the memory, and what happens when you don't stop to write it down.
17. The v3 Fix Pass — the narrative of the five architectural changes, in the order we made them.

Phase 8 — The second overnight run and v4¶

18. The Second Overnight Run — 93 rules remediated (78%), the time-waste ratio in the other 26, and three architectural findings for v4.
19. Standing on Whose Shoulders? — research validation of our choices, the literature landscape, and the v4 interface extraction decision.
20. The Interface Extraction — ripping the engine apart mid-flight: five interfaces, a STIG runtime, and 75 tests that still passed.
21. The Task Graph — from flat queue to live DAG: dependency awareness, conflict detection, and a React Flow visualization.
22. Context Graphs and the Memory Question — the research spiral from decision provenance to NIST requirements to "do we even need a database?" — and how the clutch mechanism answered the question.

Phase 9 — The first complete run and cross-run learning¶

23. The First Complete Run — 270 rules, 13.5 hours, 85 remediated, 157 escalated — and the discovery that the cross-run memory system was storing everything but teaching nothing.
24. Run 2 — Cross-Run Learning — the fix landed: 59 rules flipped from escalated to remediated, the fix rate jumped 35% → 58%, then Run 2 exposed a new cascade and the uncomfortable question of whether memory that was right yesterday can be wrong tomorrow.
25. Run 3 — When the Learning Curve Bends — 60% fix rate, diminishing returns, the environment fidelity problem showing up in real data.

Phase 10 — The memory-architecture pivot¶

A note on "V1" and "V2": Entries 26-30 use these labels to refer to the cross-run memory architecture, not the harness (which has been on v5 throughout this arc). V1 memory is the category-level dream pass built in entries 26-27; V2 memory is the structured-tip rewrite built in entry 30. Run 4 (entry 28) ended V1; Run 5 tests V2.

26. The Little Engine That Could Needs Real Databases (and a Nap) — catching up to the 2026 memory-systems frontier: Graphiti-on-Neo4j for Reflective, Postgres replacing SQLite, and a dream pass between runs as the distinctive contribution.
27. Building the Dream Pass — one session, four bugs, a closed loop: Postgres + Neo4j + dream pass built end-to-end, progressive testing caught every crash before Run 4.
28. Run 4: When the Dream Pass Passes the Wrong Test — the V1 memory algorithm worked as plumbing but produced no aggregate gain; the per-rule analysis exposed three compounding architectural failures that V2 must address.

Phase 11 — The V2 memory rewrite¶

29. The Classifier Cheat and the Honesty Check — a near-miss while building V2: a prompt-tuned classifier that almost snuck a thumb-on-scale into the lesson backfill, and why the fix was not a better prompt but no classifier at all.
30. Building V2 — The Memory Rewrite — the seven-commit, one-day rewrite of the cross-run memory layer (V2, replacing V1's dream pass): structured tips, rule-prefix similarity, per-(tip, rule) hit-tracking, history-based eviction. Ships alongside V1 so Run 5 carries both rankings per prompt. Ends with seven graded bets for Run 5.

Phase 12 — Run 5, the deferred registry, and the CVE pivot¶

31. Run 5: Grading the Bets — four of seven bets won, three lost. V2 is neutral on aggregate (+0.1pp) but better on loop shape (escalated attempts −35%). The immutable cascade moved from position 50 to position 11 and killed the aggregate; the ordering constraint for Run 6 emerges from that diagnosis.
32. Three Tips, a Dead Clutch, and a Registry — Run 5's post-mortem exposed three separate architectural problems the aggregate hid: tips without causal mechanisms mislead the Worker, the clutch is dead code since V5, and neither consolidation pass auto-runs. Seeded docs/deferred.md as the debt registry.
33. The Second Skill: CVE Response — the pivot: every Track A runtime tuning we were about to ship was STIG-specific vocabulary dressed as architecture work. 90 minutes of research collapsed CVE from 1-2 weeks to one day (Vuls exists, ATLANTIS is open source, CVE-Bench doesn't fit our regime). The commercial landscape check confirmed no vendor ships autonomous host-level execution.
34. Run 6: Ordering Works, Runtime Doesn't — fix rate 61.9% (+5.6pp vs Run 5). The ordering constraint completely closed the audit_rules_immutable cascade (position 84/84 vs 11/83). Mechanism field held at 100% across 781 tips. Auto-consolidation retired 356 low-utility tips. Cost: +4.8h wall time (14.3→19.1h). Runtime is now the binding constraint.
35. Building the CVE Skill in a Day — eleven hours from decision to working MVP. Zero edits to the ralph loop; three harness extension points (FailureMode enum, ordering predicate, skill-dir map). Five ATLANTIS patterns adopted, six skipped. MVP smoke 3/3 first-try — validated plumbing, not architectural value. Four predictions for the first full CVE run.
36. Per-Family Reboot Batching: The Architectural Decision Before We Built It — CVE Run 1 closed 29/29 non-reboot advisories first-try; the reboot-verify smoke proved the architecture fires end-to-end but exposed that batch-all loses attribution on failure. The per-package-family rewrite (pre-implementation) — safer than batch-all, faster than per-advisory, the production architecture the whitepaper deserves.
37. Per-Family Reboot Batching Lands: 44 Advisories, Zero Escalations, One Clean Sweep — the design from entry 36 shipped in a single sprint. Four files moved (interfaces.py, ralph.py, the CVE runtime, the Architect prompt). The smoke against a stock Rocky 9 VM remediated 44/44 advisories in 35.5 minutes — 29 non-reboot + 15 reboot-required batched into 2 families (core-userland + kernel) with per-item attribution and safest-first ordering. The ghost-escalation bug and the Architect-SKIPs-reboot-items bug both got fixed and both stayed dead. resolve_deferred phase: 190s for 2 reboots across 15 items. The production story the whitepaper claims is now the story the code executes.

Architecture overview — the same content organized by layer instead of by time.
Failure modes in reflexive agent harnesses — the project-agnostic contribution piece.
Improvement proposals — the per-fix engineering docs that accompanied the v3 pass.
Gotchas — the atomic "X breaks Y because Z" lessons.