The Second Skill — CVE Response, and What the Research Told Us¶
Six runs. One skill. Every architectural conclusion we've drawn from this project — V2 memory works, ordering constraints close cascades, mechanism-field filtering helps, reflexion loops pay off — rests on a single data distribution. DISA STIG on Rocky 9. When the skill-agnostic harness thesis goes on the record in the whitepaper, it goes on one data point.
That was fine for the first six runs; the bet was "let's prove one skill works end-to-end before we complicate the picture." It is emphatically not fine anymore. Every "Track A" runtime optimization we were about to ship this week was going to be tuned against STIG's failure modes, then published as if it were a harness improvement.
The pivot: CVE response becomes the second skill, starting today. Not because the roadmap says so. Because there's no other way to know which of our improvements are architectural and which are skill-specific shortcuts.
The thing we almost did¶
The original Monday plan was "per-category wall-time budgets." Easy categories get short budgets, hard categories get long ones. unsuccessful_file_modification has gone 0/6 across two runs — give it 2 minutes instead of 20. Projected saves: 3-5 hours of wall-clock per run. Real win.
unsuccessful_file_modification is a STIG category name. So is audit_rules_dac_modification_*. So is service_kdump_disabled. We were an afternoon from baking STIG's rule-family vocabulary into the harness's budget policy, calling it "generic runtime tuning," and shipping it.
Every Track A optimization we had was skill-tuning dressed as architecture work. The second skill isn't a roadmap phase. It's the constraint that stops us from tuning ourselves into a corner.
The right shape is harness-level budget learning over skill-declared categories. The harness says "this category's p95 attempt count has been 4 in prior runs — cap it." CVE declares its own categories (kernel, library, web-service, kernel-reboot-required). The math runs unchanged. That's architectural. The values are skill-local.
But we can't design that policy without seeing it operate on two skills. Running it against one is just another round of STIG-specific overfit. So the second skill lands first, then the runtime work.
What 90 minutes of research changed¶
The prior-art search spun up this morning expected to find scattered papers and commercial tools. It found ATLANTIS.
ATLANTIS — Team Atlanta won the DARPA AIxCC Finals at DEF CON 33 in August 2025. $4M prize. The Cyber Reasoning System is open source under Apache, paper at arXiv 2509.14589, repo at github.com/Team-Atlanta/aixcc-afc-atlantis. Their architecture is close to ours: LLM orchestration plus per-language specialist agents plus program analysis plus K8s task isolation. They solved a different problem — vulnerability discovery and patch generation in source code — but the architectural patterns generalize. We can read their paper before we design, converge on the parts that match, and explicitly diverge on the parts that don't. Reference architecture without the cost of inventing one.
CVE-Bench — 509 real CVEs, NAACL 2025, aclanthology.org/2025.naacl-long.212. A deeper read says: 100% source-code patch benchmark across Python / Java / JS / PHP repos. Zero kernel, glibc, openssl, httpd, openssh, systemd — zero RPM-delivered components. The 21% SWE-agent baseline is actually against 15 Python CVEs in a subset. The benchmark doesn't fit our shape. That is not a setback, it's a positioning statement: prior work addresses the application-source-patch regime; we address the orthogonal operator regime of advisory-driven package remediation on running hosts. We'll build our own reproducible corpus (Vuls-before vs Vuls-after on Rocky 9 with a known RHSA set) and cite CVE-Bench as adjacent but different.
Vuls — vuls.io. Agentless SSH. Rocky / RHEL / Ubuntu / SUSE / Debian / Amazon / Oracle. Pulls NVD + OVAL + RHSA. Outputs JSON. The scan tool we were about to build already exists. The multi-distro dynamic scanning we were considering "too much for MVP" is a config block in Vuls. The entire scanner work we scoped at 2-3 hours collapses to "consume JSON."
NVD rate limits — 5 requests per 30 seconds unkeyed. Would absolutely have bitten us on Run 3 if we'd rolled our own scanner. Vuls handles this transparently via a local mirror. Another reason not to build what already exists.
Commercial landscape — GitHub Copilot Autofix and Snyk Agent Fix are the serious entrants. Both generate source-code patches. Neither closes the apply-and-verify loop on a running host. The agentic host-state remediation niche is untouched.
The claims we're about to try to make good on¶
Four of them, each falsifiable:
-
The harness generalizes. A skill with a different rule vocabulary (
CVE-2025-12345rather thanxccdf_org.ssgproject...), a different scanner (Vuls JSON rather than oscap XML), and a different remediation tool (dnf upgrade --advisory=Xrather than a shell script) runs through the existing Ralph loop with manifest edits, a runtime class, and prompts — no harness code changes beyond optional extension points. If this claim fails, it fails visibly: the harness code gets edited to accommodate CVE-specific logic, and the thesis is wrong. -
The ordering-constraint mechanism extends cleanly. STIG's case used
category_nearly_complete. CVE needsdeferrable_reboot— "apply all reboot-required advisories last, then reboot once, then re-verify." That's a sibling predicate under the existingdefer_untilschema. If the schema has to change, the mechanism wasn't really general — it was STIG-shaped and we didn't notice. -
Tip quality generalizes. The mechanism-field change landed 100% parser acceptance across 716+ STIG tips. If CVE lands meaningfully lower — say 70% — the fix was helping Gemma write system-admin mechanism text, not causal mechanism text. We'll find out on the first CVE run that emits enough tips.
-
The autonomous reboot-verify loop is novel. The research found no published pattern for "agent applies patch → detects reboot-required → issues reboot → re-scans → verifies." MDR/XDR products treat reboot as operator action. Ansible has the mechanics (
needs_restarting,reboot,wait_for_connection) but not the autonomous loop. If we build a clean version of this, it's a whitepaper-worthy claim. If the implementation is ugly or unreliable, it isn't.
The defensibility question we had to answer before committing¶
Rigorous check of the "nobody's doing this" claim before the build starts — because if the thing doesn't survive devil's-advocate, we shouldn't waste the week. The commercial landscape surveyed:
| Vendor class | What they market | What they actually ship |
|---|---|---|
| Qualys Agent Val, Tanium Closed-Loop | "autonomous remediation" | Validate + prioritize + human-approves, then deterministic policy execution |
| CrowdStrike Charlotte, Microsoft Security Copilot | "AI agents" | Triage + investigation, not patch application |
| Automox / Action1 / Ivanti Neurons / NinjaOne | "AI-assisted patching" | Policy-based scheduled execution; LLM is not in the decision path |
| SOAR platforms (XSOAR, Splunk, Torq) | "agentic playbooks" | Ticketize + orchestrate existing patch tools |
| MDR/XDR (SentinelOne, Cortex XDR) | "autonomous response" | Isolate/kill/quarantine, not package patching |
| Academic (CVE-Bench, AutoPatch, CodeMender, ATLANTIS) | Various | All source-code or IaC patching, not host-state remediation |
The most telling find: Red Hat Lightspeed MCP is explicitly read-only per their own blog — "to avoid introducing additional risk with autonomous AI decision-making." The vendor of the distro we're remediating has publicly chosen not to build this. Their rationale applies to their customer base: enterprise Fortune 500 with dense human staffing and support-contract liability exposure. That rationale doesn't translate to every customer.
Red Hat decided autonomous agentic package operations were too risky to ship. Their rationale applies to their customer base; it doesn't translate to every customer.
The gap in the literature isn't "nobody thought of this." It's "serious players looked at this and stepped back from the execution layer." That's a different, stronger positioning for us — we're not in unclaimed territory by ignorance, we're in unclaimed territory by deliberate choice. Which means the bar is: produce something defensible enough that the reasons those players stepped back don't apply to the regime we target.
The regime where this is defensible — and the regime where it isn't¶
In scope — Federal edge deployments: forward operating base, ruggedized/SCIF, classified enclave, edge compute at sensor sites. Shared characteristics: remote, under-staffed, infrequent expert access, real CVE exposure across the gaps. Current operational reality: Critical CVEs sit unpatched for weeks because the alternative to "autonomous remediation" isn't "human-driven remediation" — it's "no remediation until the next planned touch." CISA's KEV catalog exists precisely because this happens.
Out of scope — mainstream enterprise fleets: Fortune 500 SOCs with dense staffing, predictable change-management windows, and existing rule-driven patch automation. Red Hat's audience, not ours.
Operator-role division: human at policy level (approves the run scope, the advisory set, the maintenance window), agent at execution level (per-CVE decisions within the approved scope). Humans don't lose authority — they lose tedium.
What makes the architecture defensible where prior vendors chose not to ship: every decision has provenance back to specific prior evidence (tip retrieval + source run + outcome); every action has revert (snapshot rollback, tested across 6 runs); every outcome has attribution back to the tips that informed it. Red Hat Lightspeed went read-only because their LLM agents couldn't explain post-hoc. Ours can. That isn't "we're smarter"; it's "we shipped the provenance layer, they didn't."
Honest concession: the 31B ceiling¶
Gemma 4 31B is capable, not frontier. For high-stakes CVE decisions (complex dependency cascades, unusual configurations, contested rollback judgment) it's not going to match a 70B+ model. The architecture doesn't pretend otherwise — it supports escalation to a larger model or human-in-loop where the environment permits it. True edge deployments stay at 31B and accept bounded scope. Hybrid-edge deployments route hardest calls upward. Enterprise-adjacent deployments route to humans above that.
The harness doesn't assume a model scale. That's a claim we're making in the whitepaper, and DEF-16 in the registry captures the follow-on project that validates it: same harness, same memory architecture, progressively larger models (via build.nvidia.com API or local 2× A6000 hardware), measure which improvements are harness-contributed vs model-contributed. Separate project, this one's baseline becomes its starting point.
For the Gemma 4 thesis in this project: we claim the pattern is implementable and auditable at 31B for a bounded, honestly-scoped set of decisions. We don't claim 31B is enough for every high-stakes call.
What we're explicitly not doing¶
- Multi-distro on day one. Vuls makes the scan side free; the remediation side still needs per-distro knowledge (dnf vs apt vs zypper). MVP is Rocky 9 only. The code shape keeps the OS-detection and remediation-backend separable so Ubuntu slots in later.
- Source-code patching. That's Copilot Autofix's lane. We remediate running hosts via package management, not by writing diffs against source. Different problem, different audience.
- CVE-Bench SOTA chase. The benchmark turned out to be source-code-patch only (see above) so there's nothing to chase. Our scoring is Vuls-before / Vuls-after on a reproducible Rocky 9 corpus we build ourselves. That's a defensible methodology for a regime prior work hasn't covered.
- Live internet CVE data. Vuls mirrors NVD/OVAL locally. Our runs are reproducible; "network was flaky today" never becomes a postmortem variable.
The demo story we're building toward¶
The MVP starts against a fresh Rocky 9 baseline because determinism at build-time beats narrative. But the demo capstone is chained: a STIG run brings the VM to DISA-compliant state over ~14 hours; a CVE run takes that hardened state and closes the outstanding security advisories in another ~4 hours. Federal audit language for "this box is production-ready."
That's why the Run 6 snapshot hook armed before we wrote this entry. Run 6 is the STIG-hardened starting state the demo needs. Capturing it at completion costs us a libvirt atomic snapshot — zero minutes of engineering work that won't be possible again once Run 6 ends and the VM gets reverted for the next experiment.
What I'll be watching for¶
Three diagnostic questions once the first CVE runs land:
- Did we need to edit
gemma_forge/harness/*.pyto make CVE work? Zero edits is the strong claim. Extension points added are OK. Hacks inside the loop are the signal that the thesis has holes. - How many rounds of prompt iteration before the Architect picks CVEs sensibly? STIG took ~5 runs for the prompts to stabilize. If CVE stabilizes in 1-2, the skill-scaffolding is doing real work. If it takes 5 again, the thesis of "skill-specific is just prompts + runtime class" is weaker than claimed.
- Does the tip corpus stay clean? STIG accumulated 2,900+ tips before we ran eviction. If CVE accumulates at a similar rate and the mechanism field stays at ~100% acceptance, the V2 memory architecture scales. If either breaks, we learn something worth writing down.
Entry 34 is Run 6's post-mortem, whenever it finishes — probably today. Entry 35 will be the CVE MVP build, whenever that lands — probably today if the research holds up. Entry 36 will be the first CVE run results against Run 6's STIG-hardened snapshot, the demo moment we've actually been building toward since entry 0.
Don't want to lose a toe on the research, and we won't — the preconditions are all there. Time to build.
Related¶
journey/32— the Run 6 prep that set up ordering constraints + mechanism field, the tools CVE inherits for free.deferred.md— DEF-02's "prompt guidance is not enforcement" pattern gets its second test when CVE'sdeferrable_rebootpredicate lands; DEF-01 clutch still waits on UI work regardless.- ATLANTIS paper (arXiv 2509.14589) — reference architecture for autonomous vulnerability agents; read before the CVE skill design.
- CVE-Bench (NAACL 2025) — 509-CVE source-code patch benchmark; adjacent to our regime, not a scoring target once we read the details.
- Vuls — the CVE scanner we're not going to build because it already exists.