gemma-forge: Architecture Brief¶

An exploration of Ralph loop architecture and Gemma 4 at the edge — building your own agentic harness, from scratch.

By Ken Rollins, Chief AI Technology Strategist in Dell Federal.

Repository: github.com/kenrollins/gemma-forge · Site: kenrollins.github.io/gemma-forge

Personal Exploration

This is a personal project. It is not a Dell product, reference architecture, or supported offering. Views and findings are the author's own.

What this is¶

gemma-forge demonstrates that a smaller open-weights model on commodity edge hardware, paired with the right harness architecture, can autonomously solve complex multi-step problems — learning from every failure, improving across runs, and producing a decision audit trail that meets emerging Federal AI standards.

The architecture combines two patterns:

Ralph loop persistence — the agent doesn't stop when it fails. It diagnoses, reverts, reflects, and retries until the problem is solved or the time budget expires.
Reflexion-style self-improvement — each failure produces a distilled lesson that prevents the same mistake on the next attempt. Lessons persist across runs via a structured-tip memory store backed by Postgres and Neo4j (Graphiti), so the system gets smarter over time without any code changes.

Two skills ship today on the same harness:

DISA STIG remediation on Rocky Linux 9 — the anchor use case. Exercises every interesting property of the architecture: persistence across many retries, revert-on-failure, verifiable outcomes, and real target-system side effects. 270 rules, 7 runs including a 19-hour overnight completion.
CVE Response on Rocky Linux 9 — the second-skill validation. Autonomous advisory remediation via Vuls (scan) and dnf advisory (apply), with per-package-family reboot batching and snapshot rollback per family. Two complete runs (baseline + STIG-hardened starting state), 44/44 advisories remediated in 35 minutes, every one first-try.

STIG is the hard case. CVE is the easy case. Both run on the same harness without modification. The harness is skill-agnostic: adding a new use case is a folder-per-skill exercise with no harness modifications.

The model: Gemma 4 31B Dense¶

Google's Gemma 4 (released April 2, 2026) is the first open-weights model family with native function calling and Day-0 vLLM support. gemma-forge uses the 31B Dense Instruct variant in bf16 full precision — no quantization, no compromises on reasoning quality.

Why bf16 over quantized variants? We tested four configurations on the same hardware. The bf16 full-precision configuration at TP=4 delivered the best balance of throughput and reasoning quality. The NVFP4 quantized variant promised a smaller memory footprint (the naive estimate was 15.5 GB) but the real footprint was 22 GB because attention layers stay in bf16 — and the reasoning quality degradation wasn't worth the modest VRAM savings.

Key model characteristics:

Parameter	Value
Architecture	31B Dense, bf16 full precision
Context window	128K tokens (native)
Function calling	Native (Gemma 4 tool-call format)
Parallelism	Tensor Parallel = 4 across all 4 GPUs
KV cache	~6.5 GB across 4 GPUs at bf16 (TP=4 shards the KV heads)
Throughput	~14 tok/s sustained (TP=4 on 4× L4, no NVLink)

The model serves all three agent roles (Architect, Worker, Reflector) through a single vLLM instance. This simplifies operations and keeps the supply chain to one model weight file.

The inference engine: vLLM¶

vLLM 0.19.0 provides the OpenAI-compatible REST interface. Key architectural decisions:

Direct REST, no proxy. No LiteLLM, no commercial API gateway. The harness talks directly to vLLM's /v1/chat/completions endpoint. This decision was driven by a March 2026 supply chain incident in the LiteLLM ecosystem.
Tensor Parallelism = 4 across all four L4 GPUs. This is determined by model architecture, not operator preference — the 31B Dense model's attention head count divides evenly across 4 GPUs.
--tool-call-parser gemma4 required. Without this flag, vLLM rejects Gemma 4's native tool-call format with a 400 error.
Continuous batching handles concurrent agent requests natively, which enables future parallel worker execution.

The hardware: Dell PowerEdge XR7620¶

Component	Specification
Platform	Dell PowerEdge XR7620 (short-depth rugged edge server)
CPUs	2× Intel Xeon Gold 6442Y (96 cores total)
Memory	256 GB DDR5
GPUs	4× NVIDIA L4 24 GB (no NVLink between cards)
GPU Driver	NVIDIA 580
OS	Ubuntu 24.04 LTS
Interconnect	PCIe Gen4 ×16 per GPU (no NVLink — each GPU is independent)

The XR7620 is a 2U short-depth server designed for tactical edge deployment — data centers, forward operating bases, retail locations, or anywhere that needs GPU compute in a rugged, portable form factor. The same architecture applies to any Dell edge platform with NVIDIA GPUs: PowerEdge R760xa, XE9680, or the XR8620.

The no-NVLink constraint is deliberate: the L4 is a single-slot inference GPU without NVLink bridges. Tensor parallelism works over PCIe, but the bandwidth math is different — PCIe Gen4 ×16 provides ~32 GB/s per direction vs. NVLink's 600+ GB/s. This means TP=4 on L4s has higher inter-GPU latency than on A100/H100, but at the edge this is the trade-off: four affordable inference GPUs in a server you can carry under one arm.

The harness: Ralph loop with reflexion¶

The harness is ~4,000 lines of Python built on Google ADK for per-agent-turn machinery, with a Python-driven outer reflexion loop. The harness makes all structural decisions (retry policy, evaluation, revert, termination); the model makes all reasoning decisions (which item to work on, what approach to try, why it failed).

The loop¶

OUTER: Architect selects a work item from the task graph
INNER (time-budgeted per item):
  1. Worker generates a fix/change
  2. HARNESS evaluates deterministically (no LLM — real scanner)
  3. If PASS → checkpoint progress, advance to next item
  4. If FAIL → classify failure mode, revert, Reflector analyzes
  5. Reflector distills a one-sentence lesson → episodic memory
  6. Architect re-engages periodically: CONTINUE / PIVOT / ESCALATE

Three agent roles¶

Role	Responsibility	Tools
Architect	Selects work items, plans approaches, decides when to pivot or escalate	Scan tool (skill-provided)
Worker	Generates and applies fixes/changes to the target	Apply tool (skill-provided)
Reflector	Analyzes failures, distills lessons, recommends bans	None (pure reasoning)

Memory tiers¶

Working memory — per-attempt conversation. Cleared each turn via fresh ADK sessions to prevent context pollution.
Episodic memory — per-item attempt history. Distilled lessons (not raw text) keep the context compact.
Semantic memory — cross-item banned patterns, preferred approaches, and strategic lessons. Persists for the entire run.
Persistent memory (V2) — cross-run structured tips stored in Postgres with a Graphiti-on-Neo4j knowledge graph for causal relationships. Tips carry explicit mechanism fields (why something works, not just that it did), per-(tip, rule) utility tracking, and automatic eviction of low-utility tips. The harness starts smarter on Run 2+ with no code changes. See ADR-0016 for why SQLite (V1) was retired and what replaced it.

Evaluation triage¶

The evaluator classifies every failure into a structured mode, and the harness responds differently to each. The first four are harness-core modes; the last three were added by the CVE skill and are extension points available to any skill:

Mode	Meaning	Response
Health failure	The fix broke the target	Immediate revert
Evaluator gap	Target healthy but evaluator says fail	Count toward scanner-gap early escalation
False negative	Evaluator passed but noise triggered revert	Accept the fix (don't revert good work)
Clean failure	Normal failure	Revert + reflect + retry
Needs reboot (CVE)	Package upgraded, live verification pending reboot	Defer to post-loop per-family batch
RPM conflict (CVE)	dnf dependency conflict	Clean failure with diagnostic hint
Policy violation (CVE)	Ban-worthy approach (e.g., `dnf remove` as a "fix")	Immediate revert + ban the approach

The last three were contributed by the CVE skill via the harness's FailureMode extension point — the harness itself stays skill-agnostic; skills declare the modes their evaluator can return.

Deferred-verification architecture¶

Not every skill can evaluate immediately after an apply. CVE's kernel and glibc advisories need a reboot before verification is meaningful. The harness supports this via three skill-provided extension points:

deferrable_failure_modes on EvaluatorMetadata — tells the harness which failure modes should be deferred rather than escalated ("the fix was applied, but I can't verify it yet").
SkillRuntime.resolve_deferred(reason, items, emit) — a post-loop phase the harness invokes after the main queue drains. The skill owns the resolution mechanics (reboot, snapshot revert, healthcheck) and returns one DeferredItemOutcome per item with a skill-chosen reason string (family_verified, family_reboot_failed, etc.) that becomes the authoritative verdict — no re-evaluation.
EmitEvent callback — lets the skill stream structured progress events during long-running resolution phases so the UI has a narrative to paint instead of silence.

CVE uses all three for per-package-family reboot batching. STIG declares no deferrable modes and doesn't touch any of this code. See entry 37 for the architectural landing and ADR on deferred verification for the design rationale.

Skill-agnostic architecture¶

The harness operates on five abstract interfaces. Skills implement them for their domain:

Interface	Purpose	STIG implementation	CVE implementation
WorkQueue	Produce work items	OpenSCAP scan	Vuls scan
Executor	Apply changes	SSH + bash fix	SSH + `dnf upgrade --advisory=<ID>`
Evaluator	Check results	OpenSCAP + health checks	`dnf updateinfo` + mission health
Checkpoint	Save/restore state	libvirt VM snapshots	libvirt VM snapshots (including per-family)
SkillRuntime	Bundle the above	STIG-specific wiring	CVE-specific wiring + `resolve_deferred` for reboot batches

Adding a new skill: create a skills/<name>/ folder with a manifest, prompts, and a runtime.py implementing the five interfaces. The same task graph, evaluation triage, cross-run memory, and deferred-verification machinery work for any skill. CVE added three new harness extension points (FailureMode additions, deferrable_failure_modes, DeferredItemOutcome) without any changes to the harness loop itself — STIG never touches them and they stay inert for skills that don't need them.

Candidate future skills:

Red-team/nuclei active verification — work items are nuclei findings; evaluator is the absence of the finding on re-probe; checkpoint is a VM snapshot. Pairs with CVE as a closed-loop verify-after-patch pipeline.
Certificate rotation — work items are certs; evaluator checks TLS handshake; deferred-verification for propagation waits.
Windows STIG — same shape as Rocky STIG but against a Windows Server target. Different scanner, same harness.

Observability¶

Component	Role
OpenTelemetry	Instrumentation standard — spans emitted once, consumed by multiple backends
Jaeger	Distributed tracing — per-request trace visualization
Prometheus	Metrics collection — throughput, latency, GPU utilization
Grafana	Dashboards — operational monitoring
gemma-forge Dashboard	Live task graph heatmap, agent activity, event stream

The dashboard renders a waffle-chart heatmap of all work items, color-coded by state (green = completed, cyan = active, amber = escalated, gray = queued). Categories are visually grouped so the audience can see progress patterns at a glance. An interactive React Flow DAG view provides zoom/pan/click-to-inspect for dependency exploration.

Results¶

STIG — the hard case¶

Seven complete runs of the 270-rule DISA STIG profile. Each run builds on the previous via cross-run memory. Numbers below are from Run 6, which landed the ordering constraint (auto-dependency respect on immutable cascades) and retired 356 low-utility tips via auto-consolidation:

Metric	Run 6
Rules attempted	270
Fix rate	61.9% (+5.6pp vs Run 5)
Wall time	19.1 hours
`audit_rules_immutable` cascade	Position 84/84 (fully absorbed; was 11/83 in Run 5)
Mechanism field compliance on tips	100% (781 tips)
Low-utility tips retired at run-end	356 (auto-consolidation)

STIG is where the reflexion loop earns its keep — multi-attempt fixes, Architect re-engagements, Reflector plateaus, and genuine strategy pivots are routine.

CVE — the easy case¶

Two complete runs (stock Rocky 9 baseline + STIG-hardened starting state). Same harness, same loop, same Gemma 4 deployment:

Metric	CVE Run
Advisories attempted	44
Remediated	44/44 (100%)
Wall time	35 minutes
Remediated on attempt 1	29/29 (every single one)
Architect re-engagements	0
Reflector plateaus	0
Reboot-required advisories	15, batched into 2 families (1 glibc + 14 kernels)
Reboots issued	2 (one per family)

CVE is the opposite shape: dnf is deterministic, Vuls is deterministic, the reflexion machinery stayed quiet because no rule needed it. Same code path, load-adaptive. That the harness handles both without modification is the whole point.

Architecture evolution¶

Version	Capability
v1	Basic retry loop
v2	Reflexion (reflect on failure)
v3	Episodic memory + architect re-engagement + per-turn action budget
v4	Skill-agnostic interfaces + task graph + evaluation triage
v5	Cross-run memory (V1: SQLite dream pass), adaptive concurrency clutch (built, deferred behind UI work)
v5+V2 memory	Structured tips, rule-prefix similarity, per-(tip, rule) utility tracking, history-based eviction; SQLite retired for Postgres + Neo4j (Graphiti)
v5+ordering	Dependency-aware ordering constraint (entry 34); closed the STIG immutable cascade
v5+CVE	Second skill + per-family reboot batching + `DeferredItemOutcome` contract + `EmitEvent` observability (entries 33-37)

Each version is documented in the developer journal with honest failures, pivots, and discoveries.

Decision provenance and Federal AI compliance¶

Every autonomous action the harness takes is captured in a structured JSONL event stream with full provenance:

What was attempted and why
What the evaluator found
Why the reflector said it failed
What the architect decided (CONTINUE / PIVOT / ESCALATE)
What lessons were distilled for future attempts

This aligns with the NIST AI Agent Standards Initiative (February 2026) requirements for chain-of-custody logging, prompt provenance, and audit trails for autonomous agent actions. The decision trace is not a bolt-on compliance feature — it is how the architecture works.

Technology stack summary¶

Layer	Component	Why
Model	Gemma 4 31B Dense bf16	Open weights, native tool calling, Day-0 vLLM support
Inference	vLLM 0.19.0, TP=4	Direct OpenAI-compatible REST, continuous batching
Harness	Python + Google ADK	Ralph loop + reflexion, skill-agnostic interfaces
Memory	Postgres + Neo4j (Graphiti)	Structured-tip utility tracking, causal-graph relationships, history-based eviction
Target	libvirt VM + virsh snapshots	Two-tier revert safety (script + full-state snapshot)
Observability	OTel + Jaeger + Prometheus + Grafana	Federal-credible, no vendor lock-in
Frontend	Next.js + React Flow	Live heatmap + interactive DAG + activity ticker
Hardware	Dell PowerEdge XR7620, 4× L4	Rugged edge, no NVLink, air-gappable

How to learn more¶

Start here:

Architecture Overview — the 5-layer map with components and industry alternatives
Failure Modes in Reflexive Agent Harnesses — the project-agnostic contribution piece

If you have 15 minutes:

The Second Skill: CVE Response — the pivot that tested the skill-agnostic thesis.
Run 6 — Ordering Works, Runtime Doesn't — the STIG run that closed the immutable cascade.
Per-Family Reboot Batching Lands — 44/44 CVE remediations, two families batched, zero retries.

If you want the full story:

Developer Journal — 37 chronological field notes from origin through the CVE skill. Start at Entry 00: Origin or jump to whatever catches your eye. Two favorites for early readers: the 10-hour overnight run that found four architectural flaws, and the second overnight run that validated the fixes.

If you're building something similar:

Gotchas — 13 atomic "X breaks Y because Z" lessons that cost hours to discover
Adding a Skill — how to author a new skill for the harness