Skip to content

The Journey

First-person field notes of how gemma-forge was built. Chronological, honest, specific. Each entry is scoped to a single moment in the project — a decision, a discovery, a refactor, or a postmortem — and is meant to be readable on its own.

How to read this

  • Chronological: entries are numbered in the order they happened. Decimal numbers (00.5, 06.5, etc.) mark mid-project entries added retroactively to cover moments the original numbering missed.
  • Self-contained: each entry starts with a one-sentence hook and a "why this is its own entry" section that explains what this moment is about.
  • Cross-linked: every entry lists the related entries at the top (via frontmatter) and links to them in the body where relevant.
  • Tagged: every entry has layer, pattern, moment, and optional domain tags so you can find entries by topic in the site search.

Entries, in order

Phase 0 — Starting from scratch

Phase 1 — The inference layer

Phase 2 — The target VM

  • 04. VM Provisioning — OpenTofu + libvirt v0.9.7 + Rocky 9, and an hour of debugging a GRUB hang caused by missing ACPI features.

Phase 3 — The harness

  • 06. Tool Calling — getting Gemma 4 to actually call tools through vLLM and ADK, and realizing our first "loop" was a script pretending to be an agent.
  • 06.5. The Stateful Loop Refactor — replacing ADK's LoopAgent with a Python-driven outer loop and fresh per-turn sessions.
  • 07. The Skills System — pulling STIG-specific logic into a skill manifest so other use cases are a folder-copy away.
  • 07.5. Virsh Console Fallback — the out-of-band recovery path for when SSH+sudo is broken, and the honest documentation of its current bug.

Phase 4 — Iterating on the architecture

Phase 5 — Observability

Phase 6 — The reflexion architecture

Phase 7 — The overnight run and its aftermath

Phase 8 — The second overnight run and v4

  • 18. The Second Overnight Run — 93 rules remediated (78%), the time-waste ratio in the other 26, and three architectural findings for v4.
  • 19. Standing on Whose Shoulders? — research validation of our choices, the literature landscape, and the v4 interface extraction decision.
  • 20. The Interface Extraction — ripping the engine apart mid-flight: five interfaces, a STIG runtime, and 75 tests that still passed.
  • 21. The Task Graph — from flat queue to live DAG: dependency awareness, conflict detection, and a React Flow visualization.
  • 22. Context Graphs and the Memory Question — the research spiral from decision provenance to NIST requirements to "do we even need a database?" — and how the clutch mechanism answered the question.

Phase 9 — The first complete run and cross-run learning

  • 23. The First Complete Run — 270 rules, 13.5 hours, 85 remediated, 157 escalated — and the discovery that the cross-run memory system was storing everything but teaching nothing.
  • 24. Run 2 — Cross-Run Learning — the fix landed: 59 rules flipped from escalated to remediated, the fix rate jumped 35% → 58%, then Run 2 exposed a new cascade and the uncomfortable question of whether memory that was right yesterday can be wrong tomorrow.
  • 25. Run 3 — When the Learning Curve Bends — 60% fix rate, diminishing returns, the environment fidelity problem showing up in real data.

Phase 10 — The memory-architecture pivot

A note on "V1" and "V2": Entries 26-30 use these labels to refer to the cross-run memory architecture, not the harness (which has been on v5 throughout this arc). V1 memory is the category-level dream pass built in entries 26-27; V2 memory is the structured-tip rewrite built in entry 30. Run 4 (entry 28) ended V1; Run 5 tests V2.

Phase 11 — The V2 memory rewrite

  • 29. The Classifier Cheat and the Honesty Check — a near-miss while building V2: a prompt-tuned classifier that almost snuck a thumb-on-scale into the lesson backfill, and why the fix was not a better prompt but no classifier at all.
  • 30. Building V2 — The Memory Rewrite — the seven-commit, one-day rewrite of the cross-run memory layer (V2, replacing V1's dream pass): structured tips, rule-prefix similarity, per-(tip, rule) hit-tracking, history-based eviction. Ships alongside V1 so Run 5 carries both rankings per prompt. Ends with seven graded bets for Run 5.

Phase 12 — Run 5, the deferred registry, and the CVE pivot

  • 31. Run 5: Grading the Bets — four of seven bets won, three lost. V2 is neutral on aggregate (+0.1pp) but better on loop shape (escalated attempts −35%). The immutable cascade moved from position 50 to position 11 and killed the aggregate; the ordering constraint for Run 6 emerges from that diagnosis.
  • 32. Three Tips, a Dead Clutch, and a Registry — Run 5's post-mortem exposed three separate architectural problems the aggregate hid: tips without causal mechanisms mislead the Worker, the clutch is dead code since V5, and neither consolidation pass auto-runs. Seeded docs/deferred.md as the debt registry.
  • 33. The Second Skill: CVE Response — the pivot: every Track A runtime tuning we were about to ship was STIG-specific vocabulary dressed as architecture work. 90 minutes of research collapsed CVE from 1-2 weeks to one day (Vuls exists, ATLANTIS is open source, CVE-Bench doesn't fit our regime). The commercial landscape check confirmed no vendor ships autonomous host-level execution.
  • 34. Run 6: Ordering Works, Runtime Doesn't — fix rate 61.9% (+5.6pp vs Run 5). The ordering constraint completely closed the audit_rules_immutable cascade (position 84/84 vs 11/83). Mechanism field held at 100% across 781 tips. Auto-consolidation retired 356 low-utility tips. Cost: +4.8h wall time (14.3→19.1h). Runtime is now the binding constraint.
  • 35. Building the CVE Skill in a Day — eleven hours from decision to working MVP. Zero edits to the ralph loop; three harness extension points (FailureMode enum, ordering predicate, skill-dir map). Five ATLANTIS patterns adopted, six skipped. MVP smoke 3/3 first-try — validated plumbing, not architectural value. Four predictions for the first full CVE run.
  • 36. Per-Family Reboot Batching: The Architectural Decision Before We Built It — CVE Run 1 closed 29/29 non-reboot advisories first-try; the reboot-verify smoke proved the architecture fires end-to-end but exposed that batch-all loses attribution on failure. The per-package-family rewrite (pre-implementation) — safer than batch-all, faster than per-advisory, the production architecture the whitepaper deserves.
  • 37. Per-Family Reboot Batching Lands: 44 Advisories, Zero Escalations, One Clean Sweep — the design from entry 36 shipped in a single sprint. Four files moved (interfaces.py, ralph.py, the CVE runtime, the Architect prompt). The smoke against a stock Rocky 9 VM remediated 44/44 advisories in 35.5 minutes — 29 non-reboot + 15 reboot-required batched into 2 families (core-userland + kernel) with per-item attribution and safest-first ordering. The ghost-escalation bug and the Architect-SKIPs-reboot-items bug both got fixed and both stayed dead. resolve_deferred phase: 190s for 2 reboots across 15 items. The production story the whitepaper claims is now the story the code executes.