Run 6: The Ordering Constraint Works, The Runtime Doesn't¶

Fix rate 61.9% against Run 5's 56.3% — a 5.6-point lift, more than enough to call the architectural shipping from entry 32 a success. The audit_rules_immutable cascade that dragged Run 5 down is gone. audit_rules_immutable was picked at position 84 of 84 audit rules this time, at t=19.07h, with zero audit rules scheduled after it. In Run 5 it was picked at position 11 of 83 at t=8.62h; 72 audit rules ran with the kernel locked. The skill-declared category_nearly_complete constraint did exactly what entry 32 predicted.

What happened after that is where Run 6 started to tell a different story.

The scorecard¶

Metric	Run 5	Run 6	Δ
Total rules	254	247	—
Remediated	143	153	+10
Escalated	111	94	−17
Skipped	15	23	+8
Fix rate	56.3%	61.9%	+5.6 pp
First-try of wins	87%	86%	flat
`rules_deferred` events	—	269	(new mechanism)
Mechanism-field acceptance	n/a	100% (781/781)	—
Scanner gaps	205	391	+186
PIVOT reengagements	120	327	+207
ESCALATE reengagements	110	93	−17
Tips emitted	612	781	+169
Auto-consolidation	—	dream ✓ + 356 evicted	(new)
Wall time	14.3h	19.1h	+4.8h

4 of 4 shipped changes landed clean. The last row is the problem.

The cascade is closed¶

Run 5's audit category was a graveyard: audit_rules_immutable locked the kernel at 8.6h and 58 of the remaining 83 audit rules failed through no fault of their own. Run 6's ordering constraint deferred immutable until the audit category had ≤1 rule remaining, meaning immutable itself was that last rule. It ran at 19.07h against a fully-processed audit pool.

Run 5 had 72 audit rules scheduled after immutable. Run 6 had zero. The cascade isn't reduced — it's gone.

The payoff is visible in the dac_modification family specifically. In Run 5, 12 of 14 rules cascade-failed at 2 attempts each (scanner_gap_detected with a locked kernel). In Run 6, those same 14 rules got their real shot:

audit_rules_dac_modification_fchmodat: R5 cascade-escalated at 2 → R6 first-try win
audit_rules_dac_modification_umount2: R5 cascade-escalated at 2 → R6 first-try win
audit_rules_file_deletion_events_rename: R5 cascade-escalated at 2 → R6 first-try win
audit_rules_file_deletion_events_renameat: R5 cascade-escalated at 2 → R6 first-try win
audit_rules_dac_modification_fchmod: R5 cascade-escalated at 2 → R6 won at attempt 7 (real grinding)

And the ones that stayed hard:

audit_rules_dac_modification_fchown: R5 first-try → R6 escalated at attempt 5
audit_rules_dac_modification_fchownat: R5 first-try → R6 escalated at attempt 5

That pair is a real regression — two rules Run 5 won first-try, Run 6 couldn't close in five attempts. The ordering constraint isn't blame here (immutable didn't run until t=19h). Something else is making these rules hard. A question for Run 7's prompt work.

Cross-run aggregate: 17 progressions vs 8 regressions → +9 rules net. Audit category specifically: 7 progressions, 3 regressions → net +4.

The mechanism field did what we wanted¶

781 tips emitted across the run. 100% had the mechanism field populated — zero parser drops. The Reflector learned the requirement by attempt 1 and held it through 20 hours of inference.

Tip-type distribution flipped exactly as entry 32 predicted:

Type	Run 5	Run 6
strategy	504 (82%)	729 (93%)
warning	106 (17%)	34 (4%)
recovery	0	18 (2%)
optimization	2	0

The mechanism-field requirement filtered out the "X failed" no-mechanism noise that Run 5's recovery tips suffered from. What remains is mostly positive-outcome strategy tips (first-try wins → success-mode Reflector), with a small share of warnings and mechanism-bearing recovery tips. The tip corpus coming out of Run 6 is higher signal per emission than Run 5's was.

One unforced data point: the live smoke of the CVE skill (entry 35, not yet written) emitted one success-mode tip on first try, with a valid mechanism. The field's generalizes outside STIG's linguistic register on first exposure.

Auto-consolidation works, and it's closing the loop¶

The consolidation_complete event fired cleanly at run-end:

Dream pass: 14 categories analyzed, 3,472 lessons updated, 8 positive-credit / 5 negative / 1 neutral. Environment-tagged baseline-20260419.
Eviction: 3,249 active tips at start, 487 had ≥3 outcomes, 356 retired (74% of eligible). Active count down to 2,893.

Run 5 ended with 2,973 active tips and zero eviction ever run. Run 6 ended with 2,893 active tips and 356 newly retired. The consolidation loop is closing itself now.

Five tip_retired events surfaced within Run 6 itself from a prior manual sweep, which is the dashboard's "here's what was retired since last run" signal. That's the auto-consolidation feeding the telemetry layer.

The runtime problem confirmed¶

Wall time went from 14.3h (Run 5) to 19.1h (Run 6). +4.8 hours for +10 remediated rules. Per-hour rule throughput:

Run 5: 17.8 rules/h · 10.0 remediated/h
Run 6: 12.9 rules/h · 8.0 remediated/h

Throughput dropped 28%. Let the hourly curve tell the rest:

Hours	Remediations
0–4	94 (avg 18.8/h)
5–9	40 (avg 8.0/h)
10–14	9 (avg 1.8/h)
15–19	10 (avg 2.0/h)

The first five hours produced 61% of all wins. Hours 10–19 — nearly half the runtime — produced 12% of wins. Marginal returns collapse after the easy categories are done, and the loop keeps spending 20-minute wall budgets on rules that are never going to remediate.

Contributing factors:

Architect reengagement up 173% (120 PIVOT → 327 PIVOT). More active architect is good for quality, costly for wall time.
Scanner gaps up 91% (205 → 391). More often the loop hits "many approaches failing with healthy target," which triggers reengagement, which runs the Architect again.
Hard audit tail (enabled by ordering constraint — the last half of audit category runs serially now rather than cascade-failing fast).

The honest read: the ordering constraint traded "fast cascade failure" for "slow real grinding." The fix rate gain is real, but it cost hours of wall time on rules that still failed. Not every audit rule is cascade-doomed; some are just hard. The loop doesn't distinguish.

What Run 6 tells us about what to fix next¶

Three things surfaced:

Per-rule budget should adapt to historical outcome. A rule that has escalated N times in prior runs probably doesn't deserve a fresh 20-minute budget. unsuccessful_file_modification_* is 0/6 across Run 5 and Run 6 — it should get a 2-minute budget (or an explicit skip) on Run 7. Harness-level policy with skill-declared category hints. This is Track A from entry 33's runtime discussion, but refined: the budget mechanism is skill-agnostic, the category-specific numbers are skill-declared.
Tip decay on failing rules. Run 5's post-mortem already identified this: if a tip has been retrieved N times with zero outcome, drop it. Wasn't implemented for Run 6 because it's skill-agnostic architecture work that benefits more from two-skill evidence. Run 7 is the right time.
Scanner-gap threshold may be tuned too tight. 391 gap events in Run 6 is almost double Run 5. If the threshold is triggering too eagerly, the Architect is getting re-engaged on rules that might have resolved in another attempt or two. Worth an ablation: what if scanner_gap_threshold goes from 3 distinct approaches to 4? How does that shift the PIVOT rate and the wall time?

None of these are CVE-blockers. The CVE MVP landed clean today; its runtime behavior won't be known until a full CVE run produces similar hourly-throughput data. The discipline from entry 33 still holds: don't design Track A against one-skill evidence.

Entry 33 claims, now scored¶

The four claims entry 33 put on the record about what a second skill would test:

#	Claim	Status
1	Harness generalizes — zero edits to `gemma_forge/harness/*.py` to add CVE	Held. Added one ordering predicate (wildcard dispatch), one FailureMode enum value, and a skill-dir map entry. All mechanism work, no CVE-specific logic in the loop.
2	Ordering-constraint mechanism extends cleanly	Held. `deferrable_reboot` is a sibling predicate under the existing `defer_until` schema. Wildcard `rule_id: "*"` was added to the filter dispatch, not a schema change.
3	Tip quality generalizes	Preliminary held. 1/1 CVE tips emitted had valid mechanism (smoke run). Full run will tell.
4	Autonomous reboot-verify loop is novel	Held. Deferrable-reboot predicate + CveEvaluator's NEEDS_REBOOT verdict route the pattern. Verified in runtime wiring; will exercise on a full CVE run with kernel RLSAs in the pool.

The bet I most wanted to win — fix rate lift — landed. Not the 65% unlock some part of me was hoping for; not the neutral result I feared. Six points up, with a provable architectural cause.

The narrower post-mortem question¶

Ken asked if Run 6's post-mortem would surface anything that should change the CVE skill direction before we push through. The honest answer is no. Every architectural claim Run 6 tested landed. The CVE skill inherits those mechanisms and the same plumbing that worked in Run 6 worked in the CVE MVP smoke three hours ago. The runtime concern is real but orthogonal — it applies to both skills and the fix is the same skill-agnostic Track A work either way.

What Run 6 gave CVE is the confidence that mechanism-field, ordering-constraints, and auto-consolidation are stable primitives. Building the CVE skill on those foundations was the right call; the post-mortem validates it retrospectively.

journey/33 — the pivot that happened alongside this run. Claims made there, scored here.
journey/32 — where the ordering constraint, mechanism field, and auto-consolidation shipped.
journey/31 — Run 5's post-mortem; the baseline against which Run 6 is measured.
docs/research/cve-agent-landscape-2026-04.md — landscape research captured today. Nothing in it changes based on Run 6.
deferred.md — Track A tunings for runtime (per-rule budget adaptation, tip decay, scanner-gap threshold) are the natural follow-on. Not landed as a DEF yet because their design needs the CVE-run data before it's meaningful.

Run 6: The Ordering Constraint Works, The Runtime Doesn't¶

The scorecard¶

The cascade is closed¶

The mechanism field did what we wanted¶

Auto-consolidation works, and it's closing the loop¶

The runtime problem confirmed¶

What Run 6 tells us about what to fix next¶

Entry 33 claims, now scored¶

The narrower post-mortem question¶

Related¶