Journey: Observability — From Langfuse to OTel-Pure¶
The story in one sentence¶
The original PRD specified Langfuse; it evolved to "OTel primary, Langfuse secondary"; then during Phase 0.5 I discovered Langfuse was already running on the host AND had security concerns, which was the push to drop it entirely in favor of an OTel-pure stack that's more Federal-credible anyway.
What I planned¶
The original PRD said Langfuse for tracing. During the interview phase, I picked "Both" on Langfuse vs OTel — meaning OpenTelemetry as the instrumentation standard with Langfuse as the LLM-friendly UI on top.
What changed¶
Discovery 1: Langfuse already running on the host¶
During Phase 0.5 host prep, docker ps revealed 29 containers
already running on the XR7620, including langfuse-web and
langfuse-worker (6 days uptime). Also: litellm, a full Supabase
stack, Qdrant, ClickHouse, MinIO, Redis, Mattermost, Traefik, and
Unstructured.
This triggered the "shared host service" insight: if Langfuse is already running, GemmaForge should connect to it as a client, not spin up its own copy.
Discovery 2: Langfuse has security issues¶
Langfuse had surfaced security issues that made migrating off it attractive — specifically, wanting something more Federal-credible that would run only on this host.
This was the tipping point. Building GemmaForge to depend on a product the host operator is migrating away from is a strategic mistake.
The pivot: OTel-pure¶
The OpenTelemetry GenAI semantic conventions (gen_ai.* attributes)
reached parity with Langfuse's data model in 2025. Everything Langfuse
shows about an LLM call has a vendor-neutral OTel equivalent:
| What | Langfuse | OTel GenAI equivalent |
|---|---|---|
| Token usage | Built-in dashboard | gen_ai.usage.input_tokens, gen_ai.usage.output_tokens → Prometheus counters |
| Prompt/completion content | Trace browser | gen_ai.prompt and gen_ai.completion span events → Jaeger |
| Cost tracking | Token × price | PromQL on token counters × configured price |
| Model identity | Trace metadata | gen_ai.request.model, gen_ai.response.model span attributes |
| Session/user grouping | Sessions tab | Custom span attributes (forge.run_id, forge.skill, forge.role) |
The replacement stack: - OTel Collector — receives OTLP from the harness, fans out - Jaeger v2 — trace storage + UI (prompt/completion visible as span events) - Prometheus v3 — metrics (token counters with labels) - Grafana v11 — dashboards (token spend, latency, mission-app uptime)
Five containers vs Langfuse's six. Net headcount actually goes down.
The Federal-credibility argument¶
Every Federal observability team already runs OTel + Prometheus + Grafana. GemmaForge's traces are immediately legible in their existing tools without translation. "We use the same observability stack you do" is a stronger answer than "we use this third-party LLM-specific product you haven't heard of."
Key artifacts¶
- ADR-0007 — the full decision record
docker-compose.yml— the OTel-pure stack (otel-collector, jaeger, prometheus, grafana)- Memory:
project_existing_host_services.md— inventory of what's already running - Memory:
feedback_dont_touch_docker.md— the "don't disturb existing containers" rule that the shared-host-service pattern codifies