System Architecture¶
This architecture wasn't designed top-down. It emerged from building the system, running it, watching it fail, and fixing what broke. Thirty-seven journal entries of iteration — plus two shipping skills (STIG and CVE Response) — is where we ended up, not where we planned to be on day one. Every component choice has a story behind it, and most of those stories involve trying something else first.
The rest of this page is the map. Two diagrams give the mental model: a flow view of the reflexion loop and a responsibility view of the harness/skill boundary. A colored 5-layer stack below shows what gemma-forge runs at each layer. Two tables at the bottom consolidate industry alternatives and the architectural patterns that are the actual transfer value — the ideas travel even when the tool choices don't.
The architecture at a glance¶
Two views of the same system:
- The reflexion loop — how one work item moves through the four agent roles. This is the flow the live dashboard renders.
- The skill boundary — what lives in the fixed harness versus what a skill provides through five Protocol methods. This is the thesis: the core doesn't change, the skills plug in.
The reflexion + Ralph loop¶
flowchart LR
Scan([WorkQueue.scan<br/>produces item queue]) --> Arch
Arch[Architect<br/><i>picks item + plans</i>] --> Work[Worker<br/><i>applies fix</i>]
Work --> Eval{"Evaluator<br/><i>deterministic</i><br/><i>(not LLM)</i>"}
Eval -->|pass| Done[Remediated<br/>+ checkpoint saved]
Eval -->|clean failure| Refl[Reflector<br/><i>distills lesson</i>]
Eval -->|health failure| Revert[Revert snapshot]
Eval -->|deferrable<br/>e.g. needs_reboot| Defer[Post-loop:<br/>resolve_deferred]
Revert --> Refl
Refl -->|distilled lesson<br/>Architect re-plans| Arch
Done -.->|next item| Arch
classDef scan fill:#0f172a,stroke:#9CA3AF,color:#E5E7EB
classDef architect fill:#0b2942,stroke:#3B82F6,color:#DBEAFE
classDef worker fill:#422006,stroke:#F59E0B,color:#FDE68A
classDef eval fill:#1f2937,stroke:#6B7280,color:#E5E7EB
classDef reflector fill:#1f1b2e,stroke:#A855F7,color:#EDE9FE
classDef done fill:#064e3b,stroke:#10B981,color:#ECFDF5
classDef fail fill:#7f1d1d,stroke:#EF4444,color:#FEE2E2
classDef defer fill:#78350f,stroke:#F59E0B,color:#FFFBEB
class Scan scan
class Arch architect
class Work worker
class Eval eval
class Refl reflector
class Done done
class Revert fail
class Defer defer
Two feedback loops are overlaid here.
- Solid arrow (Reflector → Architect) — reflexion within a single item. When an attempt fails, the Reflector distills a lesson, and the Architect re-plans for the next attempt on the same item with that lesson in context. The lesson lives in episodic memory (per-item, ephemeral).
- Dashed arrow (Remediated → Architect) — Ralph persistence across items. When an item finishes (pass or escalate), the harness picks the next item from the queue and grinds on. The outer loop stops only when the queue is empty or the wall-clock budget is exhausted.
Neither arrow is where cross-run memory gets written. That happens at item completion (promoting distilled lessons into semantic memory for this run) and at end-of-run (promoting the run's best signal — both successful approaches and failed-attempt lessons — into persistent memory via the dream pass). That full lifecycle is the next diagram.
Colors match the live dashboard's agent pipeline: Architect blue,
Worker amber, Reflector purple. All three are LLM roles running
on the same Gemma 4 deployment through fresh ADK sessions per
turn. The Evaluator is gray because it's not an LLM — it's
whatever skill-provided code decides whether the target is now in
the desired state (OpenSCAP for STIG, dnf updateinfo for CVE,
etc.).
The skill boundary¶
The harness is fixed. Skills plug in through five Protocol methods.
The table below shows the same contract implemented two ways — once
for STIG, once for CVE. The Protocol column is the constant; the
right two columns change completely between skills. None of the
differences between STIG and CVE reach the Ralph loop; both skills
boot from the same ./bin/forge run <skill> entry point.
The amber-tinted bottom row is the extension point CVE added. It
took one new dataclass (DeferredItemOutcome), one new callback
type (EmitEvent), and three new FailureMode enum values —
landing in a single commit to the harness. STIG never touches any
of it. Adding a third skill follows the same recipe: implement the
five interfaces, declare whether you need resolve_deferred, and
./bin/forge run <your-skill> boots the exact same Ralph loop.
The four memory tiers¶
Memory flows outward from each agent turn into progressively longer-lived stores. Each tier is scoped to a specific lifespan and has a distinct retrieval discipline.
flowchart TD
W["<b>Working</b> — per-attempt<br/>Raw agent messages, tool call results.<br/>Cleared each turn via fresh ADK session."]
E["<b>Episodic</b> — per-item<br/>Attempt history + distilled lessons.<br/>Retrieved into prompts on subsequent attempts."]
S["<b>Semantic</b> — per-run<br/>Cross-item banned patterns + strategic lessons.<br/>Token-budgeted into every prompt this run."]
P["<b>Persistent (V2)</b> — cross-run<br/>Structured tips with causal mechanism.<br/>Postgres + Neo4j / Graphiti. Per-(tip, rule) utility tracking."]
W -->|Reflector distills lesson| E
E -->|on item success<br/>promote strategic lessons| S
S -->|end-of-run consolidation<br/>+ dream pass| P
P -.->|rule-prefix retrieval<br/>at new item / new run| E
classDef working fill:#3A5A8C,stroke:#93C5FD,color:#DBEAFE
classDef episodic fill:#1E3A5F,stroke:#60A5FA,color:#DBEAFE
classDef semantic fill:#0A2955,stroke:#3B82F6,color:#DBEAFE
classDef persistent fill:#020A24,stroke:#1E3A8A,color:#DBEAFE
class W working
class E episodic
class S semantic
class P persistent
The dashed arrow from Persistent back to Episodic is the whole point of the V2 memory rewrite: tips from prior runs get pulled into the current item's context via rule-prefix similarity, so a fresh run on Day 2 starts smarter than a fresh run on Day 1 without any code changes. The dream pass promotes raw lessons into structured tips with causal mechanism fields at run-end; the Phase H eviction policy retires low-utility tips with enough evidence. See ADR-0016 for why SQLite (V1) was retired, and journey/30 for the V2 rewrite details.
The event substrate: structured JSONL run logger¶
Every event the harness emits — agent turns, evaluator verdicts,
memory retrievals, checkpoint operations, tool calls, deferred-
verification progress, GPU snapshots — lands as a single JSON line
in runs/run-<timestamp>.jsonl. That file is not a debug log. It
is the substrate.
Why this decision mattered
The structured run logger was built early (see journey/12.5) and became load-bearing for almost everything that came after:
- Live dashboard — the
/api/live-streamSSE endpoint tails the active JSONL and ships events to the UI. - Replay UI — client-paced replay streams the same JSONL through a RAF loop, so historical runs render with the same fidelity as live ones.
- Cross-run memory mining — the V2 dream pass and consolidation phase read past JSONLs to distill structured tips. No JSONL, no cross-run learning.
- NIST-grade decision provenance — the JSONL is the audit trail: every autonomous action, why it was taken, what the evaluator found, and what lesson was stored. Compliance is not a bolt-on; it's how the architecture works. See the alignment with the NIST AI Agent Standards Initiative.
- Post-mortems — journey entries from Run 6 onward cite specific elapsed_s offsets in JSONL files. The narrative record and the execution record are the same record.
The rule that keeps the substrate useful: no event is
observable that isn't JSONL-captured. When we added
family-level progress events to resolve_deferred
(entry 37),
they were JSONL-first; the UI rendering came after. OTel
adds distributed tracing on top for SRE-style performance
debugging, but JSONL is the authoritative record.
The 5-Layer Stack with Components¶
Component-by-component view of where each piece of gemma-forge lives. Layer bands match the 5-Layer Enterprise AI Partner Map colors used elsewhere on the site so the visual language is consistent.
dnf advisory, per-family reboot batching.gemma_forge/harness/ralph.py — the outer reflexion loop, skill-agnostic.FunctionTool for tool calls.resolve_deferred/EmitEvent.Industry Alternatives¶
How each layer maps to the broader ecosystem. If you can't use what gemma-forge uses, these are the entries you'd look at instead. The architectural patterns further down apply regardless of which vendor or open-source alternative you pick — they are properties of the layer, not of any specific implementation. That is the transfer value of this project: the ideas travel, even when the tool choices don't.
| Layer | Open source | Enterprise | gemma-forge |
|---|---|---|---|
| 5 — Application | Open WebUI | Harvey, Veeva AI, Glean | STIG + CVE Response skills, Dashboard, this Journal |
| 4 — Orchestration | LangChain / LangGraph, LlamaIndex, CrewAI, Google ADK | Microsoft Agent Framework | Ralph Loop + ADK + five Protocol interfaces + V2 memory (Postgres + Neo4j / Graphiti) |
| 3 — Model | Llama 3.x, Mistral / Mixtral, Phi-3, Qwen, DeepSeek, Gemma 4 | GPT-5, Claude, Gemini API | Gemma 4 31B Dense bf16 |
| 3 — Model (inference engine) | vLLM, NVIDIA Triton | TensorRT-LLM, NVIDIA NIM | vLLM 0.19.0, TP=4 across 4× L4 |
| 2 — Platform / MLOps | OTel + Jaeger + Prometheus + Grafana, Langfuse, MLflow, W&B | Arize AI, Datadog LLM | OTel + Jaeger + Prometheus + Grafana |
| 1 — Infrastructure | MinIO, Ceph, Postgres + pgvector, Proxmox VE, libvirt + KVM | Snowflake, Databricks, Weka, VMware vSphere | Dell PowerEdge XR7620 + 4× NVIDIA L4 + libvirt + Rocky 9 + OpenTofu |
A few notes on the choices that aren't obvious from the table:
- L3 model serving without a proxy. gemma-forge calls vLLM's
/v1/chat/completionsdirectly, no LiteLLM or commercial gateway — a direct consequence of the March 2026 supply-chain incident documented in journey/03.5. - L1 infrastructure is deliberately unexciting. Rocky 9 + libvirt + OpenTofu are well-understood, boring components. The interesting work lives at L3 and L4; L1 stays out of the way.
- No vector store. The V2 memory system uses rule-prefix similarity and Graphiti knowledge-graph queries rather than embedding search. See ADR-0016 for the rationale.
Architectural Patterns¶
Patterns are reusable design ideas that travel independent of which vendor or open-source alternative you pick. The table maps each one to the layer where most of its logic lives, any secondary layer it touches, and the treatment that goes into the mechanism.
| Pattern | Primary | Secondary | Where it lives |
|---|---|---|---|
| skill-authoring | L5 | L4 | Five Protocol interfaces (WorkQueue, Executor, Evaluator, Checkpoint, SkillRuntime) plus optional EvaluatorMetadata, DeferredItemOutcome, and EmitEvent. See adding-a-skill. |
| reflexion-loop | L4 | — | Outer retry loop, architect re-engagement, content-set plateau detection. See Failure Modes §5. |
| tool-calling | L4 | L3 | Per-turn action budget defaulting to 1. The harness defines the contract; the model's native tool-call support determines what's possible. See Failure Modes §1. |
| context-management | L4 | L3 | Deterministic prompt-budget assembly with distilled episodic memory. The harness assembles the prompt; the inference engine enforces the window. See Failure Modes §6. |
| snapshot-revert | L4 | L1 | Decision policy at L4, mechanism at L1. Hypervisor-level snapshots defeat anything the executor can break. See Failure Modes §2. |
| deferred-verification | L4 | — | deferrable_failure_modes + resolve_deferred + DeferredItemOutcome + EmitEvent. Items that can't be verified in the moment (reboots, propagation waits) batch to a post-loop phase. See Failure Modes §7 and journey/37. |
| parallelism | L3 | — | TP/PP choice, NVLink vs PCIe, multi-GPU bandwidth. Gemma 4 31B Dense at TP=4 on L4s without NVLink. See journey/10. |
| quantization | L3 | — | NVFP4 vs bf16 tradeoffs, the VRAM math, when quantization helps vs hurts throughput. See journey/02 and gotchas/nvfp4-vram-math. |
Further reading¶
- Failure Modes in Reflexive Agent Harnesses — the project-agnostic taxonomy of seven failure modes with prescribed harness mechanisms for each.
- Adding a Skill — how to implement the five Protocol interfaces and wire into the harness's optional extension points.
- Developer Journal — 37 chronological field notes on how this was built. For the current state of the architecture: journey/33 — The Second Skill, journey/34 — Run 6, and journey/37 — Per-Family Reboot Batching Lands.
- Gotchas — atomic "X breaks Y because Z" lessons, each tagged to a specific layer.