Skip to content

CVE remediation agent landscape — April 2026

Reference synthesis assembled during the CVE-skill pivot (entry 33). Answers: "what's already shipping in this space, what's being published, and what benchmarks exist?" Citations are URL-primary so any claim here is re-verifiable by reading the source.

Three sub-questions, three sections: commercial, academic, benchmarks. Each ends with "what we adopt / differ from / leave alone" for the GemmaForge CVE-response skill.


Commercial: enterprise vulnerability + patch management

Surveyed April 19, 2026. Every vendor uses "autonomous," "agentic," or "closed-loop" in marketing. The substantive classification underneath:

1. ML risk scoring / exploit prediction (most common category)

  • ExPRT.AI (CrowdStrike) — Falcon prioritizes CVEs by observed exploitability + environmental risk. ML scoring, not agent decision.
  • TruRisk (Qualys) — same shape, Qualys VMDR platform.
  • Kenna Security (Cisco) — CVE prioritization via exploit intelligence. No remediation execution.
  • What this actually does: sorts a queue for humans. Doesn't apply.

2. Exploit validation agents (narrow autonomy)

3. Agentic NL interfaces over existing patch tools

4. Rule-driven automation with AI-assisted scheduling

5. SOAR playbook orchestration

  • Palo Alto XSOAR (Demisto), Splunk SOAR (Phantom), Swimlane, Torq, Tines, Google SecOps SOAR (Siemplify), IBM QRadar SOAR — 2026 "AI-driven investigation agents" do root-cause and correlation. CVE playbooks exist but ticketize and orchestrate existing patch tools. An LLM doesn't execute dnf/apt on a host.

6. MDR/XDR autonomous response

  • CrowdStrike Falcon, SentinelOne Singularity, Cortex XDR, Microsoft Defender for Endpoint — autonomous response means isolate / kill / quarantine, not package-manager remediation. SentinelOne and Automox explicitly partner because XDR doesn't patch.

7. Cloud-native patch orchestration

  • AWS Systems Manager Patch Manager, Azure Update Manager, GCP OS Config — no agentic LLM layer announced through April 2026. Baselines + maintenance windows.

8. Startups + niche 2024-2026

9. Most telling data point: Red Hat

The vendor of the distro we remediate publicly decided autonomous agentic package operations are too risky to ship. That's domain expertise we take seriously. It also explains why the gap exists: serious players considered the execution layer and chose not to build it.


Academic: LLM + vulnerability remediation

All 2024-2026 published work falls into one of three buckets:

1. Source-code vulnerability patching

2. Vulnerability discovery / pentesting

3. Operational CVE remediation on running hosts

  • None. Not a single published paper or open-source project executes an autonomous scan → decide → apply → verify → iterate loop on a running host's package manager.

ATLANTIS (AIxCC 1st place, most relevant prior art)

Team Atlanta won DEF CON 33 AIxCC Finals ($4M prize, August 2025). Open source under Apache. Paper at arxiv.org/abs/2509.14589, repo at github.com/Team-Atlanta/aixcc-afc-atlantis. Architecture (relevant subset — §7 ATLANTIS-Patching):

  • Ensemble of agents over a single agent (§7.1.2). They ran 5 diverse agent designs in parallel, took the first valid patch. "No single agent consistently outperformed all others; the best- performing agent varied depending on the task."
  • Two-level policy enforcement (§7.1.2). Policies written into each agent's prompt AND re-validated rule-based after return. Prevents "plausible but incorrect patches" — e.g., removing functionality entirely or suppressing errors through a large try-catch block.
  • CRETE (§7.3) — shared harness library. Environment / Evaluator / CodeRetriever / FaultLocalizer; agents contribute strategy only. Our SkillRuntime Protocol is the same pattern.
  • State-machine workflow (§7.6 Algorithm 5 MULTIRETRIEVAL) — single agent state machine EVALUATE → ANALYZE_ISSUE → RETRIEVE → EVALUATE → DONE, capped at max_n_evals = 10. Our Ralph loop is the same shape with different action names.
  • Verdict vocabulary (§7.6.2): PLAUSIBLE / UNCOMPILABLE / VULNERABLE. Our CVE equivalent: APPLIED / STILL_VULNERABLE / NEEDS_REBOOT / RPM_CONFLICT / POLICY_VIOLATION / HEALTH_FAILURE.
  • PRISM multi-team shape (§7.7) — Supervisor + Analysis + Patch + Evaluation teams. Same shape as our Architect + Worker
  • Auditor + Reflector, independently converged.
  • EvaluationReporter (§7.7.3) — "generates comprehensive reports for failed patches, analyzing failure patterns and providing actionable feedback." This is our Reflector.
  • Stack-trace dedup (§3.2, §7.2.1) — dedup by structured fingerprint, not LLM similarity. Explicitly rejected LLM-as-judge and self-consistency.
  • Property analysis step (§7.8.5 VINCENT) — before patching, LLM enumerates invariants the program must preserve. "LLMs could infer properties without any specialized tools or a detailed prompt that explains what a property is." We adopt this in the Architect prompt.

What ATLANTIS doesn't do that we do: any long-term memory or vector store. §7 has no such component. Our dream pass + V2 tips + mechanism field architecture is actually novel relative to the $4M AIxCC winner. Worth stating in the whitepaper.

What ATLANTIS does that we don't (and won't): symbolic/concolic execution (§5.7, §6.4), directed fuzzing (§4.6, §5.5), FDP encoders (§6.9.3), PoV generation (§5.8), LSP/tree-sitter code indexing (§6.9, §7.8.3), SARIF assessment (§9), custom GRPO-trained retrieval LLM (§8). All "vulnerability discovery" work. We consume Vuls output; Vuls does the discovery.

Trail of Bits post-mortem (finalist 2nd place): blog.trailofbits.com/2025/08/07/aixcc-finals-tale-of-the-tape/ Useful operational commentary on what broke during the competition.


Benchmarks: fit-checked for our shape

CVE-Bench (Wang et al., NAACL 2025)

509 CVEs across Python / Java / JavaScript / PHP. aclanthology.org/2025.naacl-long.212/. Not applicable to us. 100% source-code patch benchmark. Zero kernel, glibc, openssl, httpd, openssh, systemd — zero RPM-delivered components. SWE-agent 21% ceiling is against 15 Python CVEs in a subset. Harness expects diff patches applied via git, not advisory IDs applied via dnf.

Not to be confused with UIUC CVE-Bench (arXiv 2503.17332, ICML 2025) — different benchmark, exploit-oriented, also doesn't fit.

AutoPenBench (Gioacchini et al., EMNLP Industry 2025)

33 vulnerable-system tasks on Dockerized VMs. github.com/lucagioacchini/auto-pen-bench. Not applicable. Scores attackers (CTF-style flag capture), not defenders. Harness expects pentesting tool calls (ExecuteBash / SSHConnect), not advisory remediation. Repurposing would be a rewrite, not an integration.

VulnRepairEval (arXiv 2509.03331, September 2025)

23 Python CVEs. Not applicable. Python source patches against package source trees. Not Linux-distro-package level. No public GitHub repo as of April 2026.

SEC-bench (arXiv 2506.11791)

Closer to "validated security patches" but still source-level.

ZeroDayBench (arXiv 2603.02297)

Production codebases, source-level patches.

LiveCVEBench (livecvebench.github.io)

Contamination-free continuous benchmark; still source-code.

The positioning

The benchmark landscape sorts into three buckets:

Bucket Examples Our overlap
Source-code patch CVE-Bench, VulnRepairEval, SEC-bench, ZeroDayBench, LiveCVEBench Zero
Exploit generation AutoPenBench, UIUC CVE-Bench (ICML) Zero (wrong role)
Autonomous host remediation via package manager nothing published That's us

Our contribution: build our own reproducible corpus (Vuls-before / Vuls-after on Rocky 9 with a known RHSA/RLSA set) and publish evaluation methodology. Cite CVE-Bench et al. in related-work as adjacent regimes.


Public CVE data programmatic access

Validated working against our setup, April 2026:

  • NVD API 2.0 — rate limit 5 req/30s unkeyed, 50/30s keyed. NIST recommends bulk refresh ≤once per 2 hours. Don't do per-run / per-host pulls from live NVD. Mirror locally via go-cve-dictionary (which we do). nvd.nist.gov/developers/request-an-api-key
  • Red Hat Security Data — CSAF per RHSA + VEX per CVE since 2024-07-10. REST: https://access.redhat.com/hydra/rest/securitydata/csaf.json. VEX tree: https://access.redhat.com/security/data/csaf/v2/vex/. Signed + hashed; no auth required for read. redhat.com/en/blog/csaf-vex-documents-now-generally-available, docs.redhat.com/en/documentation/red_hat_security_data_api
  • OSV.dev — free REST + bulk zips, 30+ ecosystem feeds including distro packages. No auth.
  • MITRE CVE Services API 2.x — free. API key required for submissions; read is open.
  • Rocky Linux — inherits Red Hat's advisory data but emits RLSA-prefixed IDs (RLSA-YYYY:NNNN, syntactically identical to RHSA). Available via dnf updateinfo directly on any Rocky host with the configured security repos.

Tooling we consume (not build)

  • Vuls (vuls.io, github.com/future-architect/vuls) — agentless SSH CVE scanner. Supports Rocky / Alma / RHEL / Ubuntu / SUSE / Debian / Amazon / Oracle. Outputs JSON. Handles NVD + OVAL + RHSA data ingestion, rate-limited NVD refresh, multi-distro normalization. We consume the JSON output.
  • Trivy, Grype + OSV-Scanner — container-focused equivalents. Not relevant to our host-remediation regime.
  • Ansible needs-restarting, reboot, wait_for_connection — the mechanics of reboot-verify patterns already exist in Ansible. The autonomous loop composing them isn't published.

What we adopt, differ from, and leave alone

Summary relevant to the GemmaForge CVE skill as of today:

Adopt from ATLANTIS

  1. Verdict vocabulary (PLAUSIBLE / UNCOMPILABLE / VULNERABLE) → extended to our RPM domain (APPLIED / STILL_VULNERABLE / NEEDS_REBOOT / RPM_CONFLICT / POLICY_VIOLATION / HEALTH_FAILURE).
  2. Two-level policy enforcement (prompt + programmatic check) — prevents dnf remove / systemctl mask as fake-fixes.
  3. Property analysis prompt step (VINCENT) — Architect enumerates invariants before apply.
  4. Per-(package, advisory) dedup with retry cap ~10.
  5. Deterministic-evaluator first-valid-wins — Vuls re-scan is the judge; don't add LLM-as-judge on top.

Adopt from commercial (by negation)

  1. Scope restriction to bounded advisory sets (not "upgrade everything" — Automox does that, we don't).
  2. Package-inventory diff as policy check (ATLANTIS motivated, Red Hat's read-only stance reinforced).
  3. Operator approves run scope, agent handles execution — same division of labor as Ivanti but with LLM reasoning in the execution layer that Ivanti doesn't have.

Differ deliberately

  1. Long-term memory layer (dream pass + V2 tips + mechanism field) — ATLANTIS has no such component. Novel contribution.
  2. Autonomous apply → reboot → verify loop — no published pattern anywhere. Novel contribution.
  3. Consume Vuls output rather than build our own scanner. Different from ATLANTIS (which built custom static analysis) because our problem doesn't need discovery, only remediation.

Leave alone (explicitly)

  1. Source-code patching (Copilot Autofix's lane).
  2. Symbolic execution / fuzzing / PoV generation (ATLANTIS's lane, vulnerability discovery not remediation).
  3. 5-agent parallel ensemble (requires DEF-01 clutch + UI).
  4. Custom-trained retrieval model (overkill for our domain).
  5. Chasing CVE-Bench SOTA (different regime — we build our own corpus).

Citations summary

Everything referenced above, grouped:

ATLANTIS: - Paper, arXiv 2509.14589 - Repo, github.com/Team-Atlanta/aixcc-afc-atlantis - DARPA results announcement - Trail of Bits post-mortem

Academic CVE work: - AutoPatch (arXiv 2505.04195) - CVE-Bench NAACL 2025 - UIUC CVE-Bench ICML 2025 - SecureFixAgent (arXiv 2509.16275) - CodeMender (DeepMind blog) - AutoPenBench EMNLP Industry 2025 - AutoPentest (arXiv 2505.10321) - VulnRepairEval (arXiv 2509.03331) - LiveCVEBench

CVE data sources: - NVD API key policy - Red Hat CSAF/VEX GA - Red Hat Security Data API docs - Vuls documentation

Commercial references cited: - Qualys Agent Val - Tanium Autonomous Patch Management - Tanium Closed-Loop (RSAC 2026) - Ivanti Neurons (2026) - Ivanti patch management docs - CrowdStrike Charlotte AI - Falcon for IT - Microsoft Security Copilot in Defender - Red Hat Lightspeed MCP - Red Hat Insights CVE advisories - Automox - Action1 RSAC 2026 - Vicarius vIntelligence - Seemplicity agentic - Sevii Level 5 ADR - Snyk Agent Fix - GitHub Copilot Autofix agent mode

Recorded: 2026-04-19 during the CVE-skill pivot (entry 33). Revisit this document when a new vendor announces autonomous host-remediation — claims in section 9 (Red Hat's read-only stance) are the load-bearing assertion that dates fastest.