Skip to content

gemma-forge

An exploration of Ralph loop architecture and Gemma 4 at the edge — building your own agentic harness, from scratch.

By Ken Rollins, Chief AI Technology Strategist in Dell Federal.


What this is

When Google released Gemma 4 in April 2026 with native function calling and Day-0 vLLM support, I saw an opportunity to explore a question that had been nagging me: can a smaller open-weights model at the tactical edge solve real problems autonomously if you give it the right harness?

The bet: a smaller model can punch above its weight if the harness around it is doing the right work. Specifically, by combining two architectures I hadn't really seen used together before: Ralph loop persistence — where an agent doesn't quit when it fails but keeps grinding, using external state to persist across context boundaries — with Reflexion-style self-improvement, where each failure produces a self-critique that makes the next attempt smarter. I wanted to build that combined harness from scratch, understand every design decision firsthand, and run it on a Dell PowerEdge XR7620 with four NVIDIA L4 GPUs. No cloud dependency. No internet required. Everything local.

Why "gemma-forge" as a project name? Gemma, obviously, because this is built around Google's Gemma 4 model. And Forge because of what the system represents — a controlled environment where raw material gets heated, shaped, and refined through repeated cycles until it becomes something useful. Raw model output goes in; the reflexion loop hammers it against a deterministic evaluator; failures get reflected on and fed back; and what comes out is a refined solution, or an honest explanation of why the problem can't be solved yet. Each run leaves the forge smarter than the last.

Why build my own harness? We are just at the start of the agentic era. There are a lot of interesting ideas being proposed in the agentic orchestration space, and ultimately there isn't going to be a single one that we land on. What I wanted to do was go deep on an interesting architecture and build the harness from the ground up. Learn how the harness manages memory, deals with failures, leverages tool use, decides when to persist versus when to escalate. The end goal not being just about the creation of a fully functional harness, but really more about the journey. Taking the time, using the tools that are available to individuals and just building something that works end to end is a great way to learn new concepts and understand the landscape as a whole.

Lastly, I designed the harness as an extensible skill system — a skill-agnostic core with abstract interfaces that any use case can implement. DISA STIG remediation on Rocky Linux 9 is the anchor use case because it pushes every part of the architecture — persistence across many retries, real side effects on a live system, a deterministic evaluator with no ambiguity, and the need for safe revert when things go wrong. Any individual fix across its 270 rules can break SSH, sudo, or the mission application.

To validate the skill-agnostic thesis, I added CVE Response as a second skill — autonomous advisory remediation driven by Vuls (scan) and dnf advisory (apply), with per-package-family reboot batching and snapshot rollback per family. The two skills run on the same harness, the same Gemma 4 deployment, and the same four-agent reflexion loop. The harness itself doesn't know which workflow it's doing: it processes work items through interfaces, and adding a new skill is a folder and five small Python classes.


See it run

The harness live, during a STIG remediation run — Architect picks a rule, Worker applies the fix, Evaluator scans, Reflector distills a lesson on failure, repeat:

Cross-run memory working in the background — V2 tip retrieval, rule-prefix similarity firing, per-(tip, rule) utility updating as evaluations land:

Why all this documentation?

I built this project using an agentic coding workflow — a human and an AI coding partner building together at speed. Beyond sharing the source code, I wanted to capture the full process: the insights, the gotchas, the dead ends, and the moments where something finally clicked. Originally the notes were just for my own learning, but looking back at them, I think there's real value in making them public.

For this project, I decided to have my agentic coding partner capture into a journal the critical insights, decisions, successes, and failures as they were happening. For me, the focus for this effort was as much about the journey as the destination. So if you have time, explore the journal entries. Every failure mode is documented. Every pivot is explained. Every architectural decision has an entry showing what was tried, what broke, and what I landed on instead. If you haven't yet tried building your own project with an agentic coding system, I hope this gives you some insight into the process and encourages you to try. It's one of the most engaging and rewarding ways to learn — the velocity is real, the collaboration is genuine, and the results will surprise you.

I hope what I learned helps other presales engineers, SI partners, and technical evaluators build similar systems faster on their own hardware.


Explore the site

  • Architecture Brief


    The one-document overview. Covers the model, the harness, the hardware, the results, and the reading guide. Start here if you have 10 minutes.

    Read the brief

  • Architecture


    The 5-layer enterprise AI stack map with gemma-forge's components at each layer, industry alternatives (open-source and enterprise), and the six failure modes in reflexive agent harnesses.

    View the architecture

  • Journey


    Chronological field notes of how this was built. Honest, specific, and written as I went — failures included. Start at the origin, jump to the overnight run that changed everything, or skip to the CVE pivot and per-family reboot batching for the latest work.

    Read the journey

  • Improvements


    Engineering specs for each architectural fix — the v3 and v5 harness improvements, each with problem statement, mechanism, and verification criteria.

    View improvements

  • Gotchas


    Atomic "X breaks Y because Z" lessons that cost hours to discover. If you're building something similar, start here to save yourself the pain.

    Browse the gotchas

  • Reference


    ADRs for every non-obvious technical choice, plus the skill authoring guide for adding your own use case to the harness.

    View reference


The 5-Layer Enterprise AI Stack

⑤ Layer 5 — Application

STIG Remediation Skill · CVE Response Skill · gemma-forge Dashboard · This Documentation Site

Where the user sees results. Two skills ship today — STIG hardening and CVE remediation — running on the same harness. Adding a third is a folder and five Python classes.

④ Layer 4 — Orchestration

Ralph Loop Harness · Google ADK · Skills System · Cross-run Memory (Postgres + Neo4j/Graphiti) · V2 Structured Tips

Where agents reason, reflect, and persist. The harness makes structural decisions; the model makes reasoning decisions.

③ Layer 3 — Model

Gemma 4 31B Dense bf16 · vLLM 0.19.0 · Tensor Parallel = 4

Where inference happens. Full precision across all four GPUs, ~14 tok/s sustained, no NVLink required.

② Layer 2 — Platform / MLOps

OpenTelemetry · Jaeger · Prometheus · Grafana · Structured JSONL Run Logger

Where you observe and measure. Federal-credible standards, no vendor lock-in.

① Layer 1 — Infrastructure

Dell PowerEdge XR7620 · 4x NVIDIA L4 24 GB · libvirt + virsh snapshots · Rocky Linux 9

The foundation. A rugged 2U short-depth edge server — no cloud, air-gappable, built for the tactical edge.

View the full architecture with industry alternatives at each layer →


Who this is for

  • Dell presales engineers and SEs who need to understand edge AI well enough to have credible technical conversations with customers.
  • Federal technical evaluators looking at what real-world agentic deployment looks like — including the parts that don't work.
  • SI partners and reseller teams who want reference material to build their own demos and solutions.
  • Engineers building agent harnesses — the failure modes piece is deliberately project-agnostic and applies to any reflexion-loop system.

Personal Exploration

This is a personal project by Ken Rollins. It is not a Dell product, reference architecture, or supported offering. Read the full disclaimer.