Skip to content

The Structured Run Logger: JSONL as the Event Format

Sol 12.5 — The question I couldn't answer

2:47 AM. The v2 overnight run just died after six hours and 180 rules. Something went wrong around rule 140 — the remediation rate cratered and never recovered. I know this because I can see it in the final summary. What I don't know is why.

So I open the log file.

It's 38,000 lines of interleaved Python logging.info() calls. Timestamps, sure. But half of them say things like Starting attempt 3 for V-230264 and the other half say GPU memory: 21149 MiB. There's no structure. No way to ask "show me every reflection that fired between rule 130 and rule 150" without writing a regex that accounts for six different log formats that evolved over three weeks of development.

I spent forty minutes grepping. I found part of the answer — something about a banned pattern list that grew too aggressive around iteration 12. But I found it by accident, scrolling past it while looking for something else. That's not analysis. That's archaeology.

At 3:30 AM I closed the terminal and wrote one line in my notes: "I cannot answer questions about my own runs."

The shape of the fix

The problem wasn't missing data. The harness was logging plenty. The problem was that logging.info(f"Reflection: {text}") produces a string that looks like information but is actually write-only. You can print it. You can't query it.

What I needed was one structured record per event. Not one per turn, not one per attempt — one per event, because the interesting questions cut across those boundaries. And each record had to be self-contained: timestamp, elapsed seconds since run start, event type, which agent emitted it, iteration number, a freeform data payload, and — crucially — an inline GPU state snapshot on events that care about hardware.

The format had to be JSONL. Not a JSON array, because you can't tail a JSON array — you'd have to wait for the closing bracket. Not CSV, because nested structures. Not protobuf, because I need to be able to cat the file and see what happened. JSONL gives you one parseable record per line. You can tail -f it. You can stream it over SSE. You can iterate it line-by-line without loading a six-hour run into memory.

I also made a rule that stuck: never emit an event you can't name a consumer for. Every event_type in the schema has something that reads it — the dashboard, the replay system, the test harness, or the postmortem scripts. No speculative telemetry. If nobody's asking the question, don't log the answer.

I had the logger written and integrated by noon. Forty-two lines of Python. The emit() method takes a dict, stamps it with time and GPU state, and appends one line to the JSONL file. That's it. That's the whole thing.

The first payoff

That evening I kicked off another run. This time, when it stalled around iteration 8, I didn't open the log and start grepping. I opened a Python REPL and typed:

with open("runs/latest/events.jsonl") as f:
    reflections = [json.loads(line) for line in f
                   if json.loads(line)["event_type"] == "reflection"]

Twelve seconds later I was looking at every reflection the system had ever produced, in order, with timestamps, attempt numbers, and the full text of what the reflector said. The plateau was obvious — three consecutive reflections all saying "try a different service name" when the actual problem was a file permission issue. The reflector was stuck in a loop, and I could see the exact moment it started repeating itself.

The whole analysis tier is with open(...) as f: for line in f: event = json.loads(line).

That's not a simplification for the journal. That's the actual code. The overnight postmortem in journey/14 was conducted entirely with variations of that one-liner. The Tier 5 integration test asserts properties of the JSONL stream — event ordering, required fields, outcome codes — without knowing anything about harness internals. The live dashboard tails the file over SSE. The replay mode reads the same file with a speed multiplier. Same format, same code path, different iterator.

Including elapsed_s alongside the ISO timestamp turned out to be one of those cheap decisions that pays for itself constantly. Timestamps let you correlate with wall-clock events. Elapsed seconds let you read the run as a story — "the system spent 340 seconds on this rule before escalating" — without doing date math in your head.

The inline GPU state was the other non-obvious win. Yes, it makes records bigger. But it means every event is self-contained for replay. The dashboard can show what the GPUs looked like at any moment without joining against a separate time series. Self-contained records beat normalized records, every time, when your consumer count is growing.

I keep coming back to the same thought: the most important piece of infrastructure in this project is a forty-two-line append-only logger that writes one JSON object per line. Not the LLM orchestration. Not the snapshot-and-revert system. The logger. Because every downstream capability — dashboard, replay, tests, postmortems, the demo itself — assumes that event stream exists. Without it, each of those systems would have had to invent its own ad-hoc logging, and none of them would agree on the format.

Boring decisions are load-bearing decisions. Write them down.