Run 4: When the Dream Pass Passes the Wrong Test¶

The verdict¶

Fix rate 56.18% — 3.4 percentage points below Run 3. Win-to-regression ratio collapsed from 1.4:1 to 0.47:1, the first time regressions have outpaced wins in this project. The audit_rules_dac_modification_* family — which in Run 3 succeeded on the first attempt — regressed to 4–7 attempts and escalated. Same model. Same VM baseline. Same skill. The only thing that changed between Run 3 and Run 4 was the cross-run lesson context. Strictly more accumulated knowledge made the system temporarily dumber on problems it had already solved.

This is exactly the failure mode any agent system that persists learned behavior as text has to guard against: aggregating signal at the wrong granularity makes the persistence layer actively misdirect the agent. We built a working version of that failure mode and then ran it as our V1.

The dream pass works as plumbing. The algorithm's concept — outcome-driven credit assignment — is right. The V1 granularity is too coarse: category-level credit lumps together rule families with wildly different difficulty profiles, and the interaction with NULL-confidence new lessons and within-run feedback produces net-negative aggregate gain. The rest of this entry is the mechanism, and the V2 granularity that answers it.

Why this is its own entry¶

Entry 27 was the build. Entry 28 is the verdict. The V1 algorithm's coarseness has measurable, traceable consequences in the data. Documenting that pattern now, while the evidence is fresh, is more valuable than letting the result fade into a single number on a chart.

What we expected¶

Going in, the hypothesis was clean: the dream pass would deprioritize lessons from low-outcome categories (audit, banner) and promote lessons from high-outcome categories (service-config, authentication, kernel). The Worker's prompt would carry better-targeted context. First-try success would rise, escalations would shrink, fix rate would climb past 60%.

The hypothesis was about the aggregate. The actual mechanism turned out to be more subtle than the aggregate could capture.

The numbers¶

A quick operational note: at the 9-hour mark Run 4 read 70% fix rate, but that was a stage-of-run artifact — the easy front (service-config, authentication, kernel) runs first; the audit tail (74 of the remaining 82 rules) historically converts at 33%. Comparing fix rates mid-run is meaningless if the run hasn't reached the same point in the difficulty distribution. Final numbers only.

The honest per-rule comparison against Run 3, all 251 rules attempted by both runs:

Metric	Run 3	Run 4 (final)	Δ
Aggregate fix rate (completed / (completed+escalated))	59.52%	56.18%	−3.4pp
First-try success rate	51.2%	49.0%	−2.2pp
Avg attempts on completed	1.33	1.30	−0.03
Avg attempts on escalated	4.73	4.39	−0.34
Wall time per completed	55.4s	54.9s	≈0
Wall time per escalated	426s	392s	−8%

Only one of these is real: escalation attempts down 0.34 (−8% wall time). The Worker is giving up on dead ends faster, which suggests the lessons it has are routing it away from doomed approaches more efficiently. That is the dream pass doing its job, narrowly. Aggregate fix rate and first-try success both regressed — the narrow efficiency win did not translate into more rules getting fixed.

The wins, and what they actually show¶

Eight rules went from escalated in Run 3 to remediated in Run 4. The dramatic ones:

aide_build_database: 8 attempts and 564s in Run 3 → 1 attempt and 72s in Run 4.
kernel_module_sctp_disabled: 3 → 1, 242s → 32s.
sshd_enable_warning_banner: 4 → 2, 418s → 103s.
audit_rules_privileged_commands_chage: 8 → 3, 651s → 146s.
networkmanager_dns_mode: 7 → 4, 584s → 226s.
audit_rules_file_deletion_events_unlink: 5 → 3, 400s → 180s.
use_pam_wheel_for_su: 4 → 3, 361s → 153s.
sudo_remove_nopasswd: 5 → 8, 527s → 689s. (Won, but slower.)

A wins-decomposition diagnostic against the lessons available at each rule's firing time finds the picture is more nuanced than "the dream pass promoted the right lessons." Three wins (use_pam_wheel_for_su, kernel_module_sctp_disabled, aide_build_database) are clearly knowledge-driven — the top-ranked retrievable lessons were directly applicable, often from the same rule or an analogous one. Two wins (the audit unlink and chage) are best explained by within-run lesson accumulation — the harness had just been hammering setxattr and the augenrules/rules.d insight transferred to other audit rules. Two wins (networkmanager_dns_mode, sshd_enable_warning_banner) look accidental: the top-ranked lessons in their categories were about completely unrelated rules (chronyd config, RPM database corruption). One was mixed (sudo_remove_nopasswd had relevant knowledge but still took more attempts than Run 3). The mechanism produced wins, but only some of them by design.

The mechanism: audit_rules_dac_modification¶

Seventeen rules went from remediated in Run 3 to escalated in Run 4. Eight of them are audit rules, and three are specifically audit_rules_dac_modification_* — a family that previously succeeded at 1–4 attempts.

audit_rules_dac_modification_fchmod: 1 attempt, 44s, completed → 6 attempts, 567s, escalated.
audit_rules_dac_modification_fchmodat: 4 attempts, 214s, completed → 7 attempts, 552s, escalated.
audit_rules_dac_modification_fchown: 1 attempt, 41s, completed → 4 attempts, 348s, escalated.

Several audit_rules_unsuccessful_file_modification_* rules also regressed, fitting the same pattern of audit-subfamily-specific knowledge being washed out by the dream-pass-induced category penalty.

Reading the JSONL trace for audit_rules_dac_modification_fchmod makes the mechanism plain. In Run 3, the Worker hit /etc/audit/rules.d/audit.rules with a heredoc on attempt 1 and the rule passed. In Run 4, the Worker spent attempts 1–2 hammering /etc/audit/audit.rules (which augenrules regenerates from rules.d/) and the Reflector — not the prompt — had to teach it about the rules.d mechanism on attempt 2. The only thing that changed between Run 3 and Run 4 was the cross-run lesson context the Worker received.

The audit audit_rules_immutable cascade is not the cause: that rule had not yet been processed in Run 4 at the time of the dac_modification regressions. The VM was in the same state both runs. The damage came from the lessons table, not the system state.

Three compounding architectural failures¶

Three compounding architectural failures, all traceable in the data:

Category-level credit is too coarse. Audit got -0.35 confidence because of the immutable cascade and other noisy audit subfamilies (privileged_commands, etc.). The proven rules.d/heredoc lesson from Run 3, despite being load-bearing for the dac_modification family, got demoted along with the bad audit lessons. Composite score for a weight-1.0 audit lesson dropped from 1.0 (pre-dream) to 0.325 (post-dream).
NULL-confidence new lessons outrank dream-penalized old lessons. New lessons saved during Run 4 itself carry NULL confidence, which the composite ranking treats as a neutral 0.5 multiplier. A NULL-confidence weight-0.55 lesson scores 0.275; a dream-penalized weight-1.0 audit lesson scores 0.325 — close enough that fresh failure-derived lessons from Run 4 itself can flood the per-category top-5 by sheer count. At the moment the dac_modification rules fired, all top 15 audit lessons by composite score were from Run 4, all NULL-confidence, all describing Run 4's own current failures: "augenrules failed, rules.d failed, auditctl failed."
Within-run feedback loop has no damping. The harness saves new lessons during a run from failed attempts. Those new lessons enter the prompt context for subsequent rules in the same run. When the dream pass demotes the prior run's success lessons, the within-run negative lessons drown them out — turning the prompt into a "here is what doesn't work" list with no surviving "here is what does." The Worker's first attempt is now actively misdirected by the run's own struggles.

Each one of the three is a real bug. Together they produced the dac_modification regressions.

The trajectory¶

Comparison	Wins	Regressions	Ratio
Run 1 → Run 2	59	1	59:1
Run 2 → Run 3	14	10	1.4:1
Run 3 → Run 4 (final)	8	17	0.47:1

Cross-run learning's win:regression ratio is collapsing run over run. Run 4 is the first run where regressions outpace wins by more than 2-to-1, and the regressions track to a specific architectural choice made between runs. The dream pass V1 produced no aggregate gain — and in producing no gain, it taught us something specific about what V2 has to do.

What V2 needs¶

The path forward is concrete now in a way it wasn't before Run 4:

Per-rule lesson attribution (logging). The harness must log which specific lesson IDs were assembled into each rule's Worker prompt. We were able to reason about the dac_modification regression because the JSONL trace preserved the Worker's actual fix scripts; we could not have proven the lesson-displacement mechanism without that. Per-prompt lesson logging would let the dream pass do real per-lesson credit assignment in V2. Highest priority.
Fix the NULL-vs-penalty ranking flaw. A NULL-confidence lesson is treated as neutral (multiplier 0.5), which makes new lessons rank above dream-penalized older ones. Two viable fixes: (a) treat NULL as "use the source category's average confidence" rather than 0.5, so new audit lessons inherit the audit penalty until proven otherwise; or (b) cap NULL at the multiplier of the lowest-scoring category (so new lessons can never outrank dream-curated ones). Either closes the displacement loophole.
Damp within-run lesson creation OR exclude same-run lessons from prompt context. A run's own failed-attempt lessons should not be promoted to outrank prior-run successful lessons. Two options: (a) exclude source_run_id == current_run_id from the prompt-time lesson load (only cross-run lessons are loaded); or (b) save new lessons with confidence=-0.3 (slightly negative) by default, since they were derived from failed attempts. Option (a) is cleaner.
Per-rule (or per-rule-family) credit, not category-level. When a lesson from category C is loaded into rule R's prompt and R succeeds, the lesson should accrue confidence specifically — independent of whether other rules in C succeeded. Audit category contains both dac_modification_* (solvable) and rules_immutable (cascade trigger); penalizing them together is incorrect. Per-rule-family categorization (subdividing audit into audit_dac, audit_immutable, audit_privileged_cmd, etc.) would give the dream pass a meaningful unit to score against.
Don't suppress; demote. The current composite ranking makes a confidence--1 lesson score zero. Too aggressive. A floor of, say, 0.1 (instead of 0.0) keeps demonstrated-useful lessons in the candidate pool even when their category is troubled.
Honest reflection in the journal. When V1 of an architectural change produces a result like Run 4's, the move is not to defend the design — it is to say what it taught and what V2 needs. That is what this entry is for.

What this means for the broader story¶

Run 4 didn't plateau at Run 3 — it dropped 3.4 percentage points below it. That is the empirical confirmation that memory accumulation alone, even with a curation layer, can actively misdirect the system without per-instance attribution. The whitepaper said this might happen and named it as the limitation that production systems would have to solve. Run 4 confirmed it experimentally, on the harness we built, with the dream pass we shipped — and produced a negative aggregate result, not a flat one.

The thesis that the agentic harness shapes outcomes still holds. The thesis that memory curation is necessary still holds. What Run 4 added is a sharper claim: memory curation done at category granularity, with no per-(rule, lesson) causal accounting, can produce negative aggregate gain on workloads with mixed-difficulty subfamilies inside a category. The granularity has to match the causal structure of the work, and on STIG that means at least at the rule-family level — and probably at the per-(rule, lesson) level for any meaningful credit assignment.

This is the story to take into the whitepaper update. Not "dream pass works → fix rate jumps." The actual story: dream pass V1 works as plumbing, V1 granularity is empirically not just too coarse but actively misdirecting, V2 is now empirically scoped. The honesty is the strength.

journey/27 — the build that produced the V1 algorithm.
journey/26 — the architectural decision behind the dream pass.
journey/25 — the run that motivated the dream pass; ran out at 60%.
adr/0016 — the memory architecture decision.