The Second Overnight Run: 93 Rules, and What the Other 26 Teach¶
The story in one sentence¶
The v3 harness ran 9.5 hours on the same STIG workload that v2 barely dented, autonomously remediating 93 of 120 rules (78%), but the 26 escalated rules consumed 69% of wall time — and the patterns in those failures point to three harness-level architectural improvements that aren't about STIG at all.
Why this is its own entry¶
The first overnight run
(journey/14) found the flaws that
produced the v3 fix pass. This second run validates that those fixes
worked — and then reveals the next layer of architectural questions.
The pattern of "ship, observe, analyze, improve" is the Ralph loop
applied to the harness itself.
What the numbers say¶
The scoreboard¶
| Metric | v2 Run | v3 Run | Delta |
|---|---|---|---|
| Duration | 10h | 9.5h | — |
| Rules attempted | 28 | 120 | +330% |
| Remediated | 2 (7%) | 93 (78%) | +4,550% |
| Escalated | 26 (93%) | 26 (22%) | −71pp |
| Throughput | 2.8/hr | 12.5/hr | +346% |
The v3 fixes didn't just improve the fix rate — they changed the character of the run. v2 was a harness that mostly failed. v3 is a harness that mostly succeeds and fails informatively on the rest.
The efficiency story¶
First-try success rate: 79%. 74 of 94 remediations needed exactly one attempt. The model knows how to fix most STIG rules when given a clean prompt and the right tools. Median time-to-remediation: 34 seconds.
The distribution has a long tail: after the 74 one-shot fixes, there's a cliff to 5 rules at 2 attempts, then a smattering of hard cases out to 19 attempts. The hard-but-eventually-successful rules include:
file_permission_user_init_files_root— 19 attempts, 1159srsyslog_encrypt_offload_defaultnetstreamdriver— 17 attempts, 1199srsyslog_remote_access_monitoring— 13 attempts, 1004srsyslog_encrypt_offload_actionsendstreamdrivermode— 12 attempts, 864s
These four barely scraped in under the time budget. They represent "eventually correct" rules where the model needed many pivots to find the right incantation.
The time-waste ratio¶
This is the headline finding:
| Rules | Wall Time | Time/Rule | |
|---|---|---|---|
| Remediated | 94 (78%) | 2.7h (31%) | 34s median |
| Escalated | 26 (22%) | 6.1h (69%) | 959s median |
69% of the run was spent on rules that ultimately failed. The harness is fast when it works and slow when it doesn't, and it doesn't know the difference early enough.
Category performance¶
| Category | Fix Rate | Avg Time | Notes |
|---|---|---|---|
| authentication | 100% | 46s | PAM faillock, passwords — the sweet spot |
| service-config | 100% | 17s | Trivial one-liners |
| cryptography | 100% | 21s | Package installs + config |
| kernel | 89% | 94s | sysctl params; 4 failures are impossible at runtime |
| package-management | 88% | 132s | Mostly dnf install |
| logging | 73% | 509s | rsyslog config is hard |
| filesystem | 71% | 181s | Permissions mostly fine; partitioning impossible |
| integrity-monitoring | 29% | 824s | AIDE dependency chain |
| user-account | 14% | 1009s | Scanner semantic gap |
| banner | 0% | 1209s | Scanner semantic gap |
What the v3 fixes did¶
Each of the five v3 fixes is visible in the data:
-
Worker single-action enforcement — 0 tool-call-cap events. The prompt-driven approach worked without the hard cap ever firing. The model is voluntarily constraining itself to one tool call per turn.
-
Context budget assembler — No sections were ever truncated or dropped. Maximum utilization was 54% for rule selection. The budgets are generous enough.
-
Semantic plateau detection — Not directly visible (0 explicit plateau events), but the architect is escalating based on pattern recognition before the time budget runs out — 19 of 26 escalations were
architect_preemptiverather thantime_budget. -
Architect re-engagement — 181 re-engagements across the run. The architect is actively managing the loop: 89% PIVOTs, 10.5% ESCALATEs, 0.5% CONTINUEs. ESCALATE accuracy is 100%.
-
Snapshot-based revert + diagnostics — 430 reverts, all cleanly executed. 430 post-mortems, each with structured diagnostic capture. The revert-on-failure mechanism is the backbone of the loop.
Three architectural findings¶
Finding 1: Conversation history overflow¶
The problem fixed at the prompt level exists at the conversation level.
On high-attempt rules (13+ attempts), accumulated tool call/result
pairs push the vLLM context past the 16K token limit. 8 errors, all
identical: PromptTooLongError. The prompt budget assembler controls
the instruction portion, but the conversation history — SSH
commands and their multi-line output — grows unbounded within a rule.
This is the same class of problem as the episodic memory distillation, applied to the within-rule conversation. The harness needs a sliding window or summarization mechanism: keep the last N turns verbatim, compress earlier turns to a one-line summary.
Impact: Only 8 errors (low), but those errors hit rules that were already hard cases, making them harder. More importantly, this is a ticking bomb — any rule that reaches 15+ attempts will hit it.
Finding 2: Evaluation should triage, not just pass/fail¶
The current evaluation logic is binary: either the rule passes and the mission app is healthy, or we revert. The data reveals three distinct failure modes that should drive different responses:
Mode A — Health failure (2.3% of reverts): The fix broke something. Revert immediately. This is correct today.
Mode B — Scanner gap (88% of reverts): Health is fine, but the scanner says the rule still fails. The model writes technically correct config that the scanner doesn't recognize. After 3+ clean attempts with different approaches that all pass health but fail the scanner, the harness should recognize this as a knowledge gap and escalate early rather than grinding to 15+ attempts.
Mode C — False-negative revert (2.1% of reverts): The rule actually passed but journal noise (warnings, non-fatal errors) caused the harness to revert a working fix. 9 reverts threw away good work. The harness then had to re-discover the same fix on a later attempt.
Journal noise on a passing rule check should not trigger a revert. A passing scanner result should be authoritative.
Impact: Early scanner-gap detection alone would save ~4 hours of wall time. Eliminating false-negative reverts would save the 9 wasted re-discovery cycles.
Finding 3: Rule dependency awareness¶
Five AIDE rules all depend on having a working AIDE database. The architect treats them as independent, so each one independently discovers and fails on the same prerequisite. That's 5 × ~1000s = 83 minutes wasted on a problem that should have been solved once.
The architect doesn't need a full dependency graph. Even a simple heuristic — "if 2+ rules fail for the same root cause mentioned in their post-mortems, try the most fundamental one first" — would capture this pattern.
Impact: Not large in absolute time (83 minutes), but large in principle. Any skill with prerequisite chains will hit this same pattern. It's a harness-level concern, not a STIG concern.
What the v3 fixes didn't fix (and shouldn't)¶
Five rules are architecturally impossible at runtime:
partition_for_var (disk layout), grub2_disable_interactive_boot
(read-only boot partition), sysctl_kernel_kexec_load_disabled
(kernel compile-time), kernel_module_atm_disabled (compiled into
kernel), installed_OS_is_vendor_supported (vendor subscription).
The architect correctly identifies these in 2–5 attempts and
escalates preemptively. This is the right behavior — fast fail on
impossible tasks.
The Ralph loop observation¶
The harness applied the Ralph loop to 120 STIG rules. I'm now applying the Ralph loop to the harness itself:
- Fail — v2 ran overnight and remediated 2 rules.
- Diagnose — the overnight postmortem found 5 architectural flaws.
- Fix — v3 implemented all 5 fixes.
- Observe — v3 ran overnight and remediated 93 rules.
- Diagnose — the analysis above found 3 more architectural patterns.
- Next — v4 will address conversation management, evaluation triage, and dependency awareness.
This is the meta-pattern. The same persistence-and-reflection discipline that makes the harness work on STIG rules also makes the harness itself improvable in the same cadence.
Related¶
journey/14— the first overnight run postmortem that produced the v3 fix pass.journey/17— the five fixes, in sequence.architecture/01— the failure-mode taxonomy. Finding 1 is a new instance of FM-3 (context overflow). Finding 2 extends FM-1 (misdiagnosis). Finding 3 is a new failure mode: prerequisite blindness.