The Journey¶
First-person field notes of how gemma-forge was built. Chronological, honest, specific. Each entry is scoped to a single moment in the project — a decision, a discovery, a refactor, or a postmortem — and is meant to be readable on its own.
How to read this¶
- Chronological: entries are numbered in the order they happened. Decimal numbers (00.5, 06.5, etc.) mark mid-project entries added retroactively to cover moments the original numbering missed.
- Self-contained: each entry starts with a one-sentence hook and a "why this is its own entry" section that explains what this moment is about.
- Cross-linked: every entry lists the related entries at the top (via frontmatter) and links to them in the body where relevant.
- Tagged: every entry has layer, pattern, moment, and optional domain tags so you can find entries by topic in the site search.
Entries, in order¶
Phase 0 — Starting from scratch¶
- 00. The Origin of gemma-forge — why Ralph loops, why Gemma 4, why STIG as the anchor, and what the project explicitly is not.
- 00.5. How Should We Serve Gemma 4? — picking the serving strategy: precision, tensor parallelism, and the NVFP4 VRAM math surprise on L4.
Phase 1 — The inference layer¶
- 01. The Inference Layer Evolution — Triton was the first choice, we pivoted to vLLM, and we kept the Triton scaffolding for when it catches up.
- 02. Model Strategy — four configurations of the 31B tested on real hardware, and the one that worked.
Phase 2 — The target VM¶
- 04. VM Provisioning — OpenTofu + libvirt v0.9.7 + Rocky 9, and an hour of debugging a GRUB hang caused by missing ACPI features.
Phase 3 — The harness¶
- 06. Tool Calling — getting Gemma 4 to actually call tools through vLLM and ADK, and realizing our first "loop" was a script pretending to be an agent.
- 06.5. The Stateful Loop Refactor — replacing ADK's
LoopAgentwith a Python-driven outer loop and fresh per-turn sessions. - 07. The Skills System — pulling STIG-specific logic into a skill manifest so other use cases are a folder-copy away.
- 07.5. Virsh Console Fallback — the out-of-band recovery path for when SSH+sudo is broken, and the honest documentation of its current bug.
Phase 4 — Iterating on the architecture¶
- 08. Model Architecture Revision — moving away from hardware-first role assignment to judgment-based roles.
- 09. The Nemotron Experiment — cross-model Auditor role, why it worked technically, and why we walked it back.
- 10. The Parallelism Maze — every path we tried was blocked by a different constraint until only one option remained.
Phase 5 — Observability¶
- 03. Observability — the OpenTelemetry stack, the dual-purpose decision.
- 03.5. The LiteLLM Decision — the March 2026 supply chain incident, and the OTel-pure architecture that came out of it.
- 05. Infrastructure Gap — what "Day-0 model support" actually means when the surrounding stack hasn't caught up.
- 12.5. Structured Run Logger — the boring-sounding JSONL decision that became the backbone of everything downstream.
Phase 6 — The reflexion architecture¶
- 11. The Missing Reflector — realizing three agents wasn't actually reflexion and adding the fourth.
- 12. bf16 TP=4 Full Precision — the unexpected benchmark result that reshaped our production configuration.
- 13. The Retry Budget That Wasn't Ralph — replacing the attempt counter with a wall-clock budget.
Phase 7 — The overnight run and its aftermath¶
- 14. The Overnight Run — 10 hours, 2 rules remediated, 26 escalated, four architectural flaws discovered.
- 15. The Test as Architecture Discovery — the discipline reframe that turned verification tests into property tests.
- 15.5. The Test Pass in Practice — 99 tests across 7 tiers, the real bugs caught, the honest gaps.
- 16. Capturing Lightning — why the journal became the memory, and what happens when you don't stop to write it down.
- 17. The v3 Fix Pass — the narrative of the five architectural changes, in the order we made them.
Phase 8 — The second overnight run and v4¶
- 18. The Second Overnight Run — 93 rules remediated (78%), the time-waste ratio in the other 26, and three architectural findings for v4.
- 19. Standing on Whose Shoulders? — research validation of our choices, the literature landscape, and the v4 interface extraction decision.
- 20. The Interface Extraction — ripping the engine apart mid-flight: five interfaces, a STIG runtime, and 75 tests that still passed.
- 21. The Task Graph — from flat queue to live DAG: dependency awareness, conflict detection, and a React Flow visualization.
- 22. Context Graphs and the Memory Question — the research spiral from decision provenance to NIST requirements to "do we even need a database?" — and how the clutch mechanism answered the question.
Phase 9 — The first complete run and cross-run learning¶
- 23. The First Complete Run — 270 rules, 13.5 hours, 85 remediated, 157 escalated — and the discovery that the cross-run memory system was storing everything but teaching nothing.
- 24. Run 2 — Cross-Run Learning — the fix landed: 59 rules flipped from escalated to remediated, the fix rate jumped 35% → 58%, then Run 2 exposed a new cascade and the uncomfortable question of whether memory that was right yesterday can be wrong tomorrow.
- 25. Run 3 — When the Learning Curve Bends — 60% fix rate, diminishing returns, the environment fidelity problem showing up in real data.
Phase 10 — The memory-architecture pivot¶
A note on "V1" and "V2": Entries 26-30 use these labels to refer to the cross-run memory architecture, not the harness (which has been on v5 throughout this arc). V1 memory is the category-level dream pass built in entries 26-27; V2 memory is the structured-tip rewrite built in entry 30. Run 4 (entry 28) ended V1; Run 5 tests V2.
- 26. The Little Engine That Could Needs Real Databases (and a Nap) — catching up to the 2026 memory-systems frontier: Graphiti-on-Neo4j for Reflective, Postgres replacing SQLite, and a dream pass between runs as the distinctive contribution.
- 27. Building the Dream Pass — one session, four bugs, a closed loop: Postgres + Neo4j + dream pass built end-to-end, progressive testing caught every crash before Run 4.
- 28. Run 4: When the Dream Pass Passes the Wrong Test — the V1 memory algorithm worked as plumbing but produced no aggregate gain; the per-rule analysis exposed three compounding architectural failures that V2 must address.
Phase 11 — The V2 memory rewrite¶
- 29. The Classifier Cheat and the Honesty Check — a near-miss while building V2: a prompt-tuned classifier that almost snuck a thumb-on-scale into the lesson backfill, and why the fix was not a better prompt but no classifier at all.
- 30. Building V2 — The Memory Rewrite — the seven-commit, one-day rewrite of the cross-run memory layer (V2, replacing V1's dream pass): structured tips, rule-prefix similarity, per-(tip, rule) hit-tracking, history-based eviction. Ships alongside V1 so Run 5 carries both rankings per prompt. Ends with seven graded bets for Run 5.
Phase 12 — Run 5, the deferred registry, and the CVE pivot¶
- 31. Run 5: Grading the Bets — four of seven bets won, three lost. V2 is neutral on aggregate (+0.1pp) but better on loop shape (escalated attempts −35%). The immutable cascade moved from position 50 to position 11 and killed the aggregate; the ordering constraint for Run 6 emerges from that diagnosis.
- 32. Three Tips, a Dead Clutch, and a Registry — Run 5's post-mortem exposed three separate architectural problems the aggregate hid: tips without causal mechanisms mislead the Worker, the clutch is dead code since V5, and neither consolidation pass auto-runs. Seeded
docs/deferred.mdas the debt registry. - 33. The Second Skill: CVE Response — the pivot: every Track A runtime tuning we were about to ship was STIG-specific vocabulary dressed as architecture work. 90 minutes of research collapsed CVE from 1-2 weeks to one day (Vuls exists, ATLANTIS is open source, CVE-Bench doesn't fit our regime). The commercial landscape check confirmed no vendor ships autonomous host-level execution.
- 34. Run 6: Ordering Works, Runtime Doesn't — fix rate 61.9% (+5.6pp vs Run 5). The ordering constraint completely closed the audit_rules_immutable cascade (position 84/84 vs 11/83). Mechanism field held at 100% across 781 tips. Auto-consolidation retired 356 low-utility tips. Cost: +4.8h wall time (14.3→19.1h). Runtime is now the binding constraint.
- 35. Building the CVE Skill in a Day — eleven hours from decision to working MVP. Zero edits to the ralph loop; three harness extension points (FailureMode enum, ordering predicate, skill-dir map). Five ATLANTIS patterns adopted, six skipped. MVP smoke 3/3 first-try — validated plumbing, not architectural value. Four predictions for the first full CVE run.
- 36. Per-Family Reboot Batching: The Architectural Decision Before We Built It — CVE Run 1 closed 29/29 non-reboot advisories first-try; the reboot-verify smoke proved the architecture fires end-to-end but exposed that batch-all loses attribution on failure. The per-package-family rewrite (pre-implementation) — safer than batch-all, faster than per-advisory, the production architecture the whitepaper deserves.
- 37. Per-Family Reboot Batching Lands: 44 Advisories, Zero Escalations, One Clean Sweep — the design from entry 36 shipped in a single sprint. Four files moved (interfaces.py, ralph.py, the CVE runtime, the Architect prompt). The smoke against a stock Rocky 9 VM remediated 44/44 advisories in 35.5 minutes — 29 non-reboot + 15 reboot-required batched into 2 families (core-userland + kernel) with per-item attribution and safest-first ordering. The ghost-escalation bug and the Architect-SKIPs-reboot-items bug both got fixed and both stayed dead.
resolve_deferredphase: 190s for 2 reboots across 15 items. The production story the whitepaper claims is now the story the code executes.
Related¶
- Architecture overview — the same content organized by layer instead of by time.
- Failure modes in reflexive agent harnesses — the project-agnostic contribution piece.
- Improvement proposals — the per-fix engineering docs that accompanied the v3 pass.
- Gotchas — the atomic "X breaks Y because Z" lessons.