ADR-0007: OpenTelemetry + Jaeger + Prometheus + Grafana for observability (no Langfuse)¶
- Status: Accepted
- Date: 2026-04-09
- Deciders: Ken Rollins
- Related: ADR-0014
Context¶
The Ralph loop's value depends on its audit trail. Every prompt, every completion, every tool invocation, and every revert event has to be captured, queryable, and presentable to a Federal evaluator who wants to answer questions like:
- "Show me everything Architect said in step 4 of run abc123."
- "How many tokens did the Worker model burn this month, broken down by skill?"
- "How long did the Auditor take to validate the mission app on each iteration of the last run?"
- "Did the harness ever exceed our token budget for a single run?"
We also have two non-negotiable constraints:
- Federal-credible. The observability stack has to be the same stack a Federal evaluator already runs. Anything that requires them to learn a new product is a credibility tax.
- Air-gappable, locally-hostable, no SaaS. Same constraint as ADR-0001 / ADR-0014: zero phone-home, zero vendor activation, runs 100% on the XR7620 with the network cable pulled.
The original PRD specified Langfuse for LLM observability, and an
earlier draft of this decision (in docs/adr/0007-* planning) had
"OpenTelemetry primary, Langfuse secondary" with Langfuse there for
its LLM-native UI. Two facts changed the calculus:
-
The OpenTelemetry GenAI semantic conventions (
gen_ai.*attributes and events) are ratified and widely adopted as of 2025–2026. Token usage, prompt/completion content, model identity, and agent/workflow context are all first-class span attributes / span events under the official OpenTelemetry spec. Anything Langfuse exposes about an LLM call has a vendor-neutral equivalent in OTel. -
Langfuse has had documented security issues that make it a problematic dependency for a Federal-leaning reference build, and the host operator (Ken) is independently considering migrating off Langfuse for unrelated workloads. Building GemmaForge to depend on a product the host is moving away from is a strategic mistake regardless of whether the security issues are user-impacting today.
Decision¶
GemmaForge adopts an OpenTelemetry-pure observability stack with no
Langfuse dependency. The full stack runs locally inside the GemmaForge
docker-compose.yml:
| Layer | Component | Role |
|---|---|---|
| Instrumentation | OpenTelemetry Python SDK in gemma_forge.observability |
Emits OTLP spans and metrics from the harness using OpenTelemetry GenAI semantic conventions (gen_ai.* attributes and events) |
| Collector | otel/opentelemetry-collector-contrib |
Receives OTLP from the harness; fans out to Jaeger (traces) and Prometheus (metrics); extracts gen_ai.usage.* attributes into Prometheus counters |
| Trace storage + UI | Jaeger v2 (OTLP-native) | Trace storage and the human-readable trace browser. Each LLM call's prompt and completion are visible inline as span events. |
| Metrics storage | Prometheus v3 | Stores time-series metrics including token counters per {model, role, skill, run_id} |
| Dashboards | Grafana v11 | Token-accounting dashboards, latency views, mission-app uptime, GPU memory, all via PromQL against Prometheus and trace queries against Jaeger |
Token accounting via OTel GenAI semantic conventions¶
Every LLM call from the harness emits an OTel span tagged using the ratified OpenTelemetry GenAI semantic conventions:
span: gemma_forge.architect.generate
attributes:
gen_ai.system = "vllm"
gen_ai.request.model = "gemma4-31b-it"
gen_ai.request.max_tokens = 4096
gen_ai.response.id = "..."
gen_ai.response.model = "gemma4-31b-it"
gen_ai.response.finish_reasons = ["stop"]
gen_ai.usage.input_tokens = 1234
gen_ai.usage.output_tokens = 567
forge.skill = "stig-rhel9"
forge.role = "architect"
forge.run_id = "run-abc123"
forge.iteration = 4
events:
gen_ai.prompt: <full system prompt + user message>
gen_ai.completion: <full model response, including any <thought> tokens>
The OTel collector forwards the spans whole to Jaeger (so a human can
read the prompt/completion content for any selected call) and extracts
the gen_ai.usage.* attributes into Prometheus counters with labels:
gen_ai_input_tokens_total{model, role, skill, run_id}gen_ai_output_tokens_total{model, role, skill, run_id}
Grafana queries those counters via PromQL and renders dashboards like: Total tokens per skill per day, Token spend per Ralph-loop run, Tokens by role, Token blowup detection. The same metrics power alerts and quotas without writing GemmaForge-specific code.
Alternatives considered¶
-
Langfuse (the original PRD choice) — Excellent LLM-native UI for trace browsing and token accounting. Rejected for three reasons: (a) the OTel GenAI semantic conventions reached parity with Langfuse's data model in 2025, eliminating the LLM-specific capability gap that was Langfuse's original justification; (b) Langfuse has had documented security issues that make it a problematic dependency for a Federal-leaning reference build; (c) the host operator is migrating off Langfuse for unrelated workloads, so building GemmaForge to depend on it now is a strategic mistake. The OTel-pure path is more Federal-credible and avoids the maintenance/security lifecycle of a third-party product entirely.
-
Grafana LGTM stack (Loki / Grafana / Tempo / Mimir) — The modern Grafana observability stack. Excellent fit for production fleet observability, fully open source, all Apache-2. Considered seriously. Rejected because it's heavier than we need for a single-host demo: Tempo + Mimir + Loki is three separate storage backends versus Jaeger + Prometheus's two, and the operational complexity isn't justified at our scale. We may revisit if a Federal customer wants to deploy GemmaForge into an existing LGTM environment — at which point the OTel collector simply gets a different exporter target, which is exactly the portability win the OTel-pure architecture buys us.
-
Honeycomb / DataDog / New Relic / Lightstep — Excellent SaaS observability backends. Rejected on the same air-gap / no-SaaS constraint that drove ADR-0001 and ADR-0014. None of them are Federal-deployable in classified or air-gapped environments.
-
Just write trace events to a SQLite file — Considered as a zero-dependency fallback. Rejected because it doesn't tell the Federal "we use the same observability stack you do" story, and because Jaeger + Prometheus are essentially zero-dependency themselves (one container each, no external storage required).
-
OTel collector → SQLite or DuckDB — Would let us drop Jaeger and Prometheus in favor of a single embedded analytics database. Tempting for simplicity, but Jaeger's UI is the actual value add for the demo (operators want to click on a span and see the prompt/completion). Reinventing that UI is more work than running Jaeger.
Consequences¶
Positive¶
- Federal-credible by construction. Every Federal observability team already runs OTel + Prometheus + Grafana, frequently with Jaeger as the trace backend. GemmaForge's traces are immediately legible in their existing tools without translation.
- Token accounting is queryable. PromQL on
gen_ai_*_tokens_total{...}answers any question about token spend by any dimension we care to label. No proprietary query language to learn. - Vendor-neutral by standard. The harness emits standard OTel GenAI spans. Swapping any backend (e.g., to Grafana Tempo, to Honeycomb, to a customer's existing collector) is one config change, not a code change.
- No SaaS dependency, no phone-home, no license activation. Extends the air-gap-clean property of the inference layer to the observability layer.
- One fewer security maintenance lifecycle to track. Dropping Langfuse means we don't carry its CVEs, patch cadence, or upstream-product risk into a Federal reference build.
- Forward-compatible with the eventual whitepaper. The "we use open standards end-to-end" story is much cleaner than "we use open standards plus this one third-party product."
Negative / accepted trade-offs¶
- Jaeger's trace browser is more generic than Langfuse's LLM-native UI. A user who wants to see a prompt and its completion side by side has to expand the span and read the events. Mitigated by Phase 6's GemmaForge dashboard, which can deep-link to a Jaeger trace by ID and (optionally) render a prettier prompt/completion view inline using the OTel events.
- Five observability containers (
otel-collector,jaeger,prometheus,grafana, plus one Grafana provisioning sidecar if needed) vs Langfuse's six. Net headcount goes down, but the components are more numerous in concept. Mitigated by clear scoping under theobservabilityCompose profile and a singlemake obs-upMakefile target. - No built-in "user feedback" or "annotation" UI (which Langfuse provides for human-in-the-loop labeling of LLM responses). Acceptable: GemmaForge is a fully autonomous Ralph loop, not a human-in-the-loop chatbot. If we ever need human labeling, we add it as a separate skill rather than coupling it into the observability stack.
- Custom Grafana dashboards have to be authored. Mitigated by
shipping a
infra/observability/grafana/dashboards/directory with provisioned dashboards for the standard views (token spend, Ralph loop iterations, agent latency).
References¶
- OpenTelemetry GenAI semantic conventions
- Jaeger v2 (OTel-native)
- Prometheus
- Grafana
- OpenTelemetry Collector Contrib
- ADR-0014: Triton-managed vLLM director (shared host service)
- ADR-0012:
/data/<service>/host layout convention