Skip to content

The LiteLLM Decision and the OTel-Pure Choice

LiteLLM was supposed to be the default. It's an open-source proxy that sits between your app and any LLM backend, gives you one unified API, and handles trace export to tools like Langfuse. The PRD mentioned it. Early sketches assumed it. I would have had it running in a week.

Then, in early March 2026, LiteLLM versions 1.82.7 and 1.82.8 landed on PyPI with malicious code baked in. A poisoned dependency in one of LiteLLM's own build-time tools — ironically, a security scanner — had been compromised upstream. The bad versions were live for hours before detection and removal.

The LiteLLM team responded fast: pulled the packages, shipped fixes, disclosed honestly. No complaints there. And to be clear, this can happen to any package ecosystem. Python, npm, Rust crates, Go modules — all have had supply chain incidents. LiteLLM's bad luck was being a high-value target because it sits in the critical path of every LLM call.

But that's exactly the point.

The decision wasn't "LiteLLM is bad." The decision was "having a proxy layer in the critical path of every LLM call is a risk that buys less than I thought, and the project can avoid that risk entirely by talking to vLLM directly."

So I did. The harness talks to vLLM's OpenAI-compatible REST endpoint via httpx. No proxy, no abstraction wrapper, no extra process. For observability, the harness is instrumented directly with the OpenTelemetry Python SDK — spans around each LLM call, each tool invocation, each Ralph loop iteration, decorated with GenAI semantic conventions. The collector exports to Jaeger, Prometheus, and Grafana. Alongside that, the harness writes a structured JSONL event log per run as a redundant record (see journey/12.5-structured-run-logger).

The whole observability stack is CNCF-standard, open source, and well-understood in production environments. "Unexciting" is the highest compliment an infrastructure reviewer can give a software stack.

There are trades. Multi-backend portability — swapping vLLM for a cloud endpoint would need a small adapter. Langfuse's nice pre-built prompt-analysis UI — the dashboards here are custom, tailored to reflexion loops. Community knowledge — fewer Stack Overflow answers for "direct vLLM + custom OTel" than for the LiteLLM + Langfuse combo. For a project whose entire point is local-first, air-gap-capable edge inference, these are trades worth making.

The part that really lands when explaining this architecture to someone: the LiteLLM decision is a concrete, dated, verifiable event — not a hypothetical risk. "This choice came out of a real supply chain incident in March 2026" hits differently than "fewer dependencies in principle."