Skip to content

The Journey

First-person field notes of how GemmaForge was built. Chronological, honest, specific. Each entry is scoped to a single moment in the project — a decision, a discovery, a refactor, or a postmortem — and is meant to be readable on its own.

How to read this

  • Chronological: entries are numbered in the order they happened. Decimal numbers (00.5, 06.5, etc.) mark mid-project entries added retroactively to cover moments the original numbering missed.
  • Self-contained: each entry starts with a one-sentence hook and a "why this is its own entry" section that explains what this moment is about.
  • Cross-linked: every entry lists the related entries at the top (via frontmatter) and links to them in the body where relevant.
  • Tagged: every entry has layer, pattern, moment, and optional domain tags so you can find entries by topic in the site search.

Entries, in order

Phase 0 — Starting from scratch

Phase 1 — The inference layer

Phase 2 — The target VM

  • 04. VM Provisioning — OpenTofu + libvirt v0.9.7 + Rocky 9, and an hour of debugging a GRUB hang caused by missing ACPI features.

Phase 3 — The harness

  • 06. Tool Calling — getting Gemma 4 to actually call tools through vLLM and ADK, and realizing our first "loop" was a script pretending to be an agent.
  • 06.5. The Stateful Loop Refactor — replacing ADK's LoopAgent with a Python-driven outer loop and fresh per-turn sessions.
  • 07. The Skills System — pulling STIG-specific logic into a skill manifest so other use cases are a folder-copy away.
  • 07.5. Virsh Console Fallback — the out-of-band recovery path for when SSH+sudo is broken, and the honest documentation of its current bug.

Phase 4 — Iterating on the architecture

Phase 5 — Observability

Phase 6 — The reflexion architecture

Phase 7 — The overnight run and its aftermath

Phase 8 — The second overnight run and v4

  • 18. The Second Overnight Run — 93 rules remediated (78%), the time-waste ratio in the other 26, and three architectural findings for v4.
  • 19. Standing on Whose Shoulders? — research validation of our choices, the literature landscape, and the v4 interface extraction decision.
  • 20. The Interface Extraction — ripping the engine apart mid-flight: five interfaces, a STIG runtime, and 75 tests that still passed.
  • 21. The Task Graph — from flat queue to live DAG: dependency awareness, conflict detection, and a React Flow visualization.
  • 22. Context Graphs and the Memory Question — the research spiral from decision provenance to NIST requirements to "do we even need a database?" — and how the clutch mechanism answered the question.

Phase 9 — The first complete run and cross-run learning

  • 23. The First Complete Run — 270 rules, 13.5 hours, 85 remediated, 157 escalated — and the discovery that the cross-run memory system was storing everything but teaching nothing.
  • 24. Run 2 — Cross-Run Learning — the fix landed: 59 rules flipped from escalated to remediated, the fix rate jumped 35% → 58%, then Run 2 exposed a new cascade and the uncomfortable question of whether memory that was right yesterday can be wrong tomorrow.
  • 25. Run 3 — When the Learning Curve Bends — 60% fix rate, diminishing returns, the environment fidelity problem showing up in real data.