Skip to content

The First Complete Run

The story in one sentence

The v5 harness ran overnight, processed every single STIG rule without crashing, and the result was simultaneously a triumph (it finished!) and a wake-up call (the learning system I built isn't actually learning).

The overnight result

Run 1 kicked off at 5:38 PM on Saturday. By the Sunday morning check at 7:07 AM — 13.5 hours later — it had finished. Every rule processed. No crashes, no hangs, no unrecoverable states. After weeks of runs that died mid-way through from context overflows, broken checkpoints, and infinite retry loops, just finishing felt like a milestone.

The numbers:

Metric Count
Remediated 85 (31.5%)
Escalated 157 (58.1%)
Skipped 28 (10.4%)

11,518 events. 26.7 MB of structured JSONL. The biggest log file this project has produced by an order of magnitude.

The RPM cascade

The category breakdown told the real story:

Category Fix Rate Notes
authentication 100% (23/23) Clean sweep
kernel 68% (28/41) Strongest non-trivial category
audit 6% (4/65) Something is very wrong
ssh 6% (1/18) Same problem

Authentication rules? Perfect. Kernel sysctl rules? Mostly good. But audit and SSH — categories where the fixes were probably correct — cratered at 6%.

I dug into the lessons table. The SQLite memory store had captured 644 lessons during the run. And 397 of them — sixty-two percent — mentioned the RPM database.

The pattern was obvious once it surfaced: somewhere early in the run, a remediation broke the RPM database on the target VM. After that, oscap (the STIG evaluator) couldn't verify anything that required package metadata. The fixes were going in correctly — the health checks passed, the config files were right — but the evaluator kept returning FAIL because it couldn't open the RPM DB to cross-reference package state.

The Reflector saw it. Over and over:

"The remediation is technically successful, but the evaluation tool (oscap) is failing due to a corrupted or inaccessible RPM database."

"Stop modifying sysctl and prioritize repairing the RPM database to enable verification."

The Reflector knew. It said "stop trying, fix the RPM DB" in 406 separate reflections. But the harness kept grinding because the knowledge wasn't actionable — it lived in the lessons list but never reached the Worker's hands in a way that could change behavior.

The real problem: the learning loop was open

This is where the morning turned from "analyze the run results" to "audit the entire cross-run architecture." I'd built a beautiful memory system — SQLite with WAL mode, four tables, lesson weights, category stats, the whole context graph from entry 22. But tracing the actual data flow from storage through hydration to prompt injection surfaced five gaps:

1. Lesson weights never changed. Every single lesson in the database had weight 0.5 — the default. The update_lesson_weight() method existed, had clean code, had proper up/down logic. Nobody called it. Success didn't boost lessons. Failure didn't decay them. The ranking system was a no-op.

2. Only 3 lessons reached the prompt. The semantic memory summary — the text that actually gets injected into the Architect's and Worker's prompts — showed the last 3 lessons. Not the best 3. Not the most relevant 3. The last 3, by insertion order. With 644 lessons stored and 20 loaded at hydration, the agents saw... 3.

3. No per-item cross-run memory. I built query_prior_attempts(item_id) specifically so Run 2 could ask "what happened to this exact rule last time?" It was never called. When the Worker started on a rule that had been tried 3 times in Run 1 and escalated due to RPM corruption, it had no idea.

4. No category-specific lessons. I built load_lessons(category) so a Worker tackling a kernel rule would see kernel-specific insights. Never called. The Worker saw the same 3 generic lessons regardless of what it was working on.

5. No diversity in lesson selection. With all weights at 0.5, load_all_lessons(min_weight=0.3, limit=20) returned 20 arbitrary lessons. In practice, this meant 20 AIDE lessons about integrity monitoring — because they happened to be first in the table. The RPM DB lessons, which were arguably the most important finding of the entire run, might not even appear.

Five gaps. Every one was a case where the storage side was implemented correctly and the retrieval-and-injection side was either missing or truncated. The database was learning. The agents weren't reading.

The meta-question

There was a moment of pause: is this cheating?

The RPM DB problem was visible. I could hardcode a rpm --rebuilddb recovery step. I could add a pre-flight check. But that would be me fixing a problem the system discovered — human intervention masquerading as machine learning.

The better question: does the system have the mechanism to act on what it knows? If Run 2's Architect, armed with 397 lessons about RPM DB corruption, independently decides to run rpm --rebuilddb before attempting audit rules — that's the demo money shot. That's cross-run learning working as designed.

But it can only do that if the lessons actually reach the prompt.

The five fixes

All five changes are to ralph.py and touch the general harness, not the STIG skill. Any future skill benefits from the same improvements.

1. Lesson weight reinforcement. On success: boost all lessons in that category (they were available when the agent succeeded, so they're probably helpful). On escalation: decay them (they were available and didn't prevent failure). Over runs, good lessons float up, bad ones sink.

2. Show 8 lessons, not 3. Prioritize prior-run lessons (tagged [prior run]) over within-run lessons, since cross-run knowledge is the whole point of the memory system.

3. Per-item cross-run history. On first attempt at any rule, query the memory store for prior attempts against that exact rule. Inject the approach, outcome, and lesson from each prior attempt. The Worker sees "this was tried 3 times before, all failed because of RPM DB errors" before it writes a single line of bash.

4. Category-specific lessons. Load the top 5 lessons for the current rule's category and inject them at priority 5 (below semantic memory, above the final directive). When working on an audit rule, the Worker sees audit-specific lessons.

5. Diverse hydration. Load 40 lessons instead of 20, deduplicate by first 80 characters, cap at 3 per category to ensure diversity, then take the top 30. The RPM DB lessons now compete fairly with AIDE lessons instead of being crowded out.

What happens next

Run 2. The same 270 rules, the same target VM (reset to baseline), the same model. The only difference is what's in the memory store and the five fixes that let the harness actually use it.

If the fix rate improves — especially in audit and SSH categories where RPM corruption was the bottleneck — that's the cross-run learning thesis validated. The harness gets smarter by running, not by being reprogrammed.

If it doesn't improve, that's an equally important finding: maybe 644 text lessons aren't the right representation, maybe the lessons are too specific to generalize, maybe the model can't act on injected historical context at this scale. Either way, that's a real learning.

Honest truth: I'm nervous. Not about whether the harness will crash — it proved it can finish. About whether the learning is real. There's a difference between a system that stores what it learned and a system that uses what it learned. Run 2 is the test.


  • journey/22 — the architectural decision that created the memory system we just audited.
  • journey/14 — the first overnight run that proved the inner loop works but revealed the missing Architect re-engagement.
  • journey/18 — the second overnight run that proved re-engagement and checkpoint-restore work.