System Architecture¶
This architecture was not designed top-down. It emerged from building the system, running it, watching it fail, and fixing what broke. The layer map below reflects where we ended up after 23 journal entries of iteration — not where we planned to be on day one. Every component choice has a story behind it, and most of those stories involve trying something else first.
An overview of GemmaForge as a system, mapped onto the 5-Layer Enterprise AI Partner Map. Each layer block shows (1) industry examples — both open-source and enterprise-grade — so readers can see what alternatives exist at each layer, (2) the components GemmaForge uses at that layer, and (3) the architectural patterns that live primarily at that layer with links to their deep treatments.
The patterns below apply regardless of which vendor or open-source alternative you choose — they are properties of the layer, not of any specific implementation. That is the transfer value of this project: the ideas travel, even when the tool choices don't.
How to read this page¶
- Layer tells you where in the stack a thing sits. It answers "what kind of component is this?"
- Pattern tells you what reusable design idea the layer
demonstrates. Patterns are the "click-down" concepts within each layer
and each one has its own dedicated page under
architecture/. - Industry lists common alternatives so a reader can map the implementation back to their own environment. If you can't use what we used, the entries here are where you'd look instead.
- Components lists what GemmaForge actually runs at that layer.
5 — Application¶
Vertical SaaS AI, end-user applications, domain-specific solutions.
Industry examples - Harvey — legal research and drafting (enterprise) - Veeva AI — pharmaceutical and life sciences (enterprise) - Glean — enterprise search and knowledge assistants (enterprise) - Open WebUI — self-hosted chat-style application (open source)
GemmaForge components - STIG Remediation Skill — declarative skill manifest plus per-role prompts that instruct the harness how to approach DISA STIG compliance on a Rocky Linux 9 target. This is the end-user mission the project solves for. - GemmaForge Dashboard — Next.js live and replay UI that renders the Ralph loop event stream in real time, with a pipeline view, a current- step panel, and a scrollable event log. - GemmaForge Journal Site — this static site, served by GitHub Pages, built from the project's engineering notes.
Patterns at this layer - (None declared yet.) L5 content in GemmaForge is primarily implementation, not architectural patterns. When a second skill is added in the future and a "skill-authoring pattern" emerges, it will live here.
4 — Orchestration¶
RAG pipelines, agents, vector databases, LLM frameworks.
This is where the Ralph loop itself lives, and where most of GemmaForge's interesting architecture is concentrated. The four key patterns below all live at this layer.
Industry examples - LangChain / LangGraph — the most comprehensive OSS ecosystem with a strong commercial tier; LangGraph is the recommended agent surface for any workflow that needs loops, conditionals, or state persistence (open source + commercial) - Microsoft Agent Framework — the merged successor to AutoGen and Semantic Kernel, GA targeted for early 2026 (open source core, enterprise support) - LlamaIndex — data-centric retrieval and indexing, strong for RAG-heavy workloads (open source + commercial) - CrewAI — role-based multi-agent collaboration, good for team- oriented workflows (open source + commercial) - Google ADK — the agent development kit used in this project (open source, Apache 2.0) - Vector stores — Pinecone, Qdrant, Weaviate, Milvus (mixed OSS and enterprise; GemmaForge does not currently use a vector store)
GemmaForge components
- Google ADK (Agent Development Kit) — pre-1.0 but stable enough for
LoopAgent and FunctionTool use. Provides the agent turn abstraction
and the tool-calling machinery.
- Ralph Loop Harness — the project's own implementation in
gemma_forge/harness/ralph.py. Wraps ADK with an outer reflexion loop,
per-rule retry logic, wall-clock time budgeting, architect re-engagement,
and integrated diagnostics.
- Four agent roles — Architect (plans), Worker (executes), Reflector
(analyzes failures), and Eval (deterministic, non-LLM verdicts). All
three LLM roles currently run on Gemma 4 31B bf16.
- Skills system — skills/*/skill.yaml plus prompts; the harness
loads and runs any skill that provides the expected role prompts and
tool manifest. STIG remediation is the first; others are scaffolded.
- Episodic + semantic memory — per-rule distilled lessons plus
cross-rule banned patterns and strategic lessons, all token-budgeted
and assembled by an explicit prompt assembler.
- Run logger — structured JSONL event stream, one record per event,
suitable for replay and post-run analysis.
Patterns at this layer
- reflexion-loop — persistence, retry-with-learning, plateau
detection, architect re-engagement. See
01-reflexive-agent-harness-failure-modes.
- tool-calling — agent action budgets, agent-turn discipline, the
model↔tool contract. Worker caps at one tool call per turn by default.
- context-management — deterministic token-budget-aware prompt
assembly, distilled episodic memory, capped semantic memory.
- snapshot-revert — the decision layer for target recovery. Diagnoses
first, then restores a libvirt snapshot. The mechanism lives in L1;
the policy lives here.
3 — Model¶
Foundation models, LLMs, specialized AI models, and the inference engines that run them.
Industry examples - Gemma 4 family — Google's open-weights line, used in this project (open source, Apache 2.0) - Llama 3.x — Meta's open-weights flagship, dominant in enterprise open-source deployments (open source, Llama license) - Mistral / Mixtral — European open-weights with strong cost/performance, including a commercial tier (open source + commercial) - GPT-5, Claude, Gemini API — frontier proprietary models from OpenAI, Anthropic, and Google (enterprise, proprietary) - Phi-3, Qwen, DeepSeek — leading small-language-models suited for edge deployment (open source)
Inference engines - vLLM — the high-throughput OSS inference engine used in this project (open source, Apache 2.0) - NVIDIA Triton Inference Server — NVIDIA's production serving framework, currently awaiting a vLLM-backend version gap before it can run Gemma 4 (open source, NVIDIA-supported) - TensorRT-LLM — NVIDIA's lowest-latency runtime for Blackwell and Ada-class GPUs (open source + NVIDIA-enterprise) - NVIDIA NIM — NVIDIA's microservices packaging of the above, license-gated (enterprise)
GemmaForge components
- Gemma 4 31B bf16 — the sole LLM used for Architect, Worker, and
Reflector roles. Full precision, no quantization.
- vLLM 0.19.0 — OpenAI-compatible REST API, direct calls (no proxy,
no LiteLLM per the supply-chain decision in
journey/03-observability).
- gemma-forge/vllm:latest — a small custom image derived from
vllm/vllm-openai with transformers>=4.58 baked in to recognize the
gemma4 model type.
- Tensor-parallel 4-way — the 31B model split across all 4 NVIDIA
L4 GPUs with --tensor-parallel-size 4 --dtype bfloat16 --enforce-eager.
Patterns at this layer
- parallelism — TP/PP choice, NVLink vs PCIe, multi-GPU bandwidth.
See journey/10-the-parallelism-maze.
- quantization — NVFP4 vs bf16 tradeoffs, the VRAM math, when
quantization helps versus hurts throughput. See
journey/02-model-strategy and
gotchas/nvfp4-vram-math.
2 — Platform / MLOps¶
Training pipelines, experiment tracking, model monitoring, feature stores, and LLM observability.
Industry examples - Langfuse — self-hosted LLM observability, prompt management, and evaluations; 21k+ GitHub stars, MIT license (open source + commercial cloud) - Arize AI — enterprise ML and LLM observability; used at production scale by Uber, PepsiCo, Tripadvisor (enterprise) - Datadog LLM Observability — extends an existing Datadog footprint with LLM-specific tracing (enterprise) - OpenTelemetry + Jaeger + Prometheus + Grafana — standards-based observability, used in this project (open source, CNCF) - MLflow / Weights & Biases — experiment tracking and model registry (open source / enterprise tiers)
GemmaForge components
- OpenTelemetry collector — ingests spans, metrics, and logs from the
harness and the vLLM services
- Jaeger — distributed tracing backend; the run-by-run trace view
- Prometheus — metrics TSDB; GPU telemetry, run rates, token counts
- Grafana — dashboards and alerts over the above
- Structured run logger — JSONL event stream in runs/run-*.jsonl;
the authoritative per-event record, independent of the OTel stack
- bin/forge — lifecycle management script for vLLM, FastAPI, and
the Ralph loop process; single-command up/down/status
Patterns at this layer - (None declared yet.) Observability at this layer is mostly infrastructure and tool selection rather than reusable design patterns. When a future pattern — e.g., "how to instrument a reflexive agent harness for post-hoc replay" — crystallizes, it lives here.
1 — Data / Infrastructure¶
Storage, data lakes, compute infrastructure, hypervisors, and the hardware that underlies everything above.
Industry examples - Snowflake — cloud data warehouse (enterprise) - Databricks — unified data and AI platform (enterprise) - Weka — high-performance AI storage fabric (enterprise) - MinIO / Ceph — S3-compatible object storage (open source) - PostgreSQL + pgvector — relational DB with vector extension (open source) - Proxmox VE / libvirt + KVM — open-source virtualization (open source) - VMware vSphere — enterprise virtualization (enterprise)
GemmaForge components
- Dell PowerEdge XR7620 — the reference host. 2× Intel Xeon Gold
6442Y (96 threads), 256 GB RAM, 4× NVIDIA L4 (24 GB each), no NVLink.
The XR7620 is the lab environment; the techniques apply to any
comparable platform.
- Ubuntu 24.04 — the host OS
- Docker — used for the vLLM container serving, alongside existing
production Docker workloads on the same box
- libvirt + KVM — the target VM virtualization layer
- Rocky Linux 9 — the target VM, a binary-compatible RHEL 9
stand-in for development. The same playbook drops into a real RHEL 9
fleet.
- OpenTofu + dmacvicar/libvirt — target VM infrastructure as code,
Linux Foundation governance, Apache 2.0
- libvirt internal snapshots — the authoritative recovery mechanism
(baseline and rolling progress), the mechanism half of the
snapshot-revert pattern declared at L4
- /data/gemma-forge/ — host storage layout for VM state, weights,
and logs, isolated from other workloads on the same box
Patterns at this layer - (None declared yet.) Infrastructure in this project is deliberately unexciting — the target VM is Rocky 9, the hypervisor is libvirt, the IaC is OpenTofu. No new patterns discovered here because the goal was to use boring, well-understood components so the interesting work could happen at L3 and L4.
Cross-layer patterns¶
Some patterns legitimately span multiple layers. The table below maps each pattern to its primary layer (where most of its logic lives) and any secondary layers it touches.
| Pattern | Primary | Secondary | Notes |
|---|---|---|---|
reflexion-loop |
L4-orchestration | — | The outer retry loop, architect re-engagement, plateau detection. |
tool-calling |
L4-orchestration | L3-model | The harness defines the contract, but the model's native tool-call support determines what's possible. |
context-management |
L4-orchestration | L3-model | The harness assembles and budgets the prompt; the inference engine enforces the window. |
snapshot-revert |
L4-orchestration | L1-data-infrastructure | Decision policy at L4, mechanism at L1. |
parallelism |
L3-model | — | TP/PP, multi-GPU, bandwidth. Choice is driven by model architecture. |
quantization |
L3-model | — | Precision vs memory tradeoffs, VRAM math, throughput impact. |
Further reading¶
For the architectural contribution that came out of this project — a
taxonomy of six failure modes in reflexive agent harnesses with
prescribed harness mechanisms for each — see
01-reflexive-agent-harness-failure-modes.
For the narrative of how each layer was built, pick a layer from above and follow the journey entries it links to. The journey entries are first-person field notes with enough detail to reproduce the decisions.
For the gotchas — the small, atomic "X breaks Y because Z" lessons that
cost hours to discover — see the gotchas/ directory. Each one is
scoped to a single layer and tagged accordingly.