Skip to content

GemmaForge

An exploration of Ralph loop architecture and Gemma 4 at the edge — building your own agentic harness, from scratch.

By Ken Rollins, Chief AI Technology Strategist in Dell Federal.


What this is

When Google released Gemma 4 in April 2026 with native function calling and Day-0 vLLM support, I saw an opportunity to explore a question that had been nagging me: can a smaller open-weights model at the tactical edge solve real problems autonomously if you give it the right harness?

Not by throwing a bigger model at it. Not by calling a cloud API. By combining two architectures I hadn't really seen used together before: Ralph loop persistence — where an agent doesn't quit when it fails but keeps grinding, using external state to persist across context boundaries — with Reflexion-style self-improvement, where each failure produces a self-critique that makes the next attempt smarter. I wanted to build that combined harness from scratch, understand every design decision firsthand, and run it on a Dell PowerEdge XR7620 with four NVIDIA L4 GPUs. No cloud dependency. No internet required. Everything local.

Why "GemmaForge" as a project name? Gemma, obviously, because this is built around Google's Gemma 4 model. And Forge because of what the system represents — a controlled environment where raw material gets heated, shaped, and refined through repeated cycles until it becomes something useful. Raw model output goes in; the reflexion loop hammers it against a deterministic evaluator; failures get reflected on and fed back; and what comes out is a refined solution, or an honest explanation of why the problem can't be solved yet. Each run leaves the forge smarter than the last.

Why build my own harness? Agent harnesses have become a central topic in AI architecture recently — there's a growing recognition that the orchestration layer around a model matters as much as the model itself. How the harness manages memory, handles failures, controls tool use, decides when to persist versus when to escalate: these are the engineering decisions that separate a demo from a deployable system. I wanted to understand those decisions by making them myself, not by inheriting them from a framework.

Lastly, I designed the harness as an extensible skill system — a skill-agnostic core with abstract interfaces that any use case can implement. To stress-test what the architecture could handle, I needed a use case that would push every part of it: persistence across many retries, real side effects on a live system, a deterministic evaluator with no ambiguity, and the need for safe revert when things go wrong. DISA STIG remediation on Rocky Linux 9 turned out to be a perfect fit — hardening a live VM against 270 security rules, where any individual fix can break SSH, sudo, or the mission application, exercises the harness in ways that a text-generation task never would. But the harness itself doesn't know it's doing STIG. It processes work items through interfaces, and adding a new skill is a folder and five small Python classes.

Why all this documentation?

I built this project using an agentic coding workflow — a human and an AI coding partner building together at speed. Beyond sharing the source code, I wanted to capture the full process: the insights, the gotchas, the dead ends, and the moments where something finally clicked. Originally the notes were just for my own learning, but looking back at them, I think there's real value in making them public.

For this project, I decided to have my agentic coding partner capture into a journal the critical insights, decisions, successes, and failures as they were happening. For me, the focus for this effort was as much about the journey as the destination. So if you have time, explore the journal entries. Every failure mode is documented. Every pivot is explained. Every architectural decision has an entry showing what was tried, what broke, and what I landed on instead. If you haven't yet tried building your own project with an agentic coding system, I hope this gives you some insight into the process and encourages you to try. It's one of the most engaging and rewarding ways to learn — the velocity is real, the collaboration is genuine, and the results will surprise you.

I hope what I learned helps other presales engineers, SI partners, and technical evaluators build similar systems faster on their own hardware.


Explore the site

  • Architecture Brief


    The one-document overview. Covers the model, the harness, the hardware, the results, and the reading guide. Start here if you have 10 minutes.

    Read the brief

  • Architecture


    The 5-layer enterprise AI stack map with GemmaForge's components at each layer, industry alternatives (open-source and enterprise), and the six failure modes in reflexive agent harnesses.

    View the architecture

  • Journey


    22 chronological field notes of how this was built. Honest, specific, and written as I went — failures included. Start at the origin or jump to the overnight run that changed everything.

    Read the journey

  • Improvements


    Engineering specs for each architectural fix — the v3 and v5 harness improvements, each with problem statement, mechanism, and verification criteria.

    View improvements

  • Gotchas


    13 atomic "X breaks Y because Z" lessons that cost hours to discover. If you're building something similar, start here to save yourself the pain.

    Browse the gotchas

  • Reference


    ADRs for every non-obvious technical choice, plus the skill authoring guide for adding your own use case to the harness.

    View reference


The 5-Layer Enterprise AI Stack

⑤ Layer 5 — Application

STIG Remediation Skill · GemmaForge Dashboard · This Documentation Site

Where the user sees results. Skills are pluggable — STIG is the first, not the only.

④ Layer 4 — Orchestration

Ralph Loop Harness · Google ADK · Skills System · Cross-run SQLite Memory · Adaptive Concurrency Clutch

Where agents reason, reflect, and persist. The harness makes structural decisions; the model makes reasoning decisions.

③ Layer 3 — Model

Gemma 4 31B Dense bf16 · vLLM 0.19.0 · Tensor Parallel = 4

Where inference happens. Full precision across all four GPUs, ~14 tok/s sustained, no NVLink required.

② Layer 2 — Platform / MLOps

OpenTelemetry · Jaeger · Prometheus · Grafana · Structured JSONL Run Logger

Where you observe and measure. Federal-credible standards, no vendor lock-in.

① Layer 1 — Infrastructure

Dell PowerEdge XR7620 · 4x NVIDIA L4 24 GB · libvirt + virsh snapshots · Rocky Linux 9

The foundation. A rugged 2U short-depth edge server — no cloud, air-gappable, built for the tactical edge.

View the full architecture with industry alternatives at each layer →


Who this is for

  • Dell presales engineers and SEs who need to understand edge AI well enough to have credible technical conversations with customers.
  • Federal technical evaluators looking at what real-world agentic deployment looks like — including the parts that don't work.
  • SI partners and reseller teams who want reference material to build their own demos and solutions.
  • Engineers building agent harnesses — the failure modes piece is deliberately project-agnostic and applies to any reflexion-loop system.

Personal Exploration

This is a personal project by Ken Rollins. It is not a Dell product, reference architecture, or supported offering. Read the full disclaimer.