CVE remediation agent landscape — April 2026¶
Reference synthesis assembled during the CVE-skill pivot (entry 33). Answers: "what's already shipping in this space, what's being published, and what benchmarks exist?" Citations are URL-primary so any claim here is re-verifiable by reading the source.
Three sub-questions, three sections: commercial, academic, benchmarks. Each ends with "what we adopt / differ from / leave alone" for the GemmaForge CVE-response skill.
Commercial: enterprise vulnerability + patch management¶
Surveyed April 19, 2026. Every vendor uses "autonomous," "agentic," or "closed-loop" in marketing. The substantive classification underneath:
1. ML risk scoring / exploit prediction (most common category)¶
- ExPRT.AI (CrowdStrike) — Falcon prioritizes CVEs by observed exploitability + environmental risk. ML scoring, not agent decision.
- TruRisk (Qualys) — same shape, Qualys VMDR platform.
- Kenna Security (Cisco) — CVE prioritization via exploit intelligence. No remediation execution.
- What this actually does: sorts a queue for humans. Doesn't apply.
2. Exploit validation agents (narrow autonomy)¶
- Qualys Agent Val (GA March 2026) — "Industry's first AI agent for safe exploit validation and autonomous remediation." Press release: qualys.com/company/newsroom/news-releases/usa/qualys-debuts-industrys-first-ai-agent-for-safe-exploit-validation. What it actually does: validates exploitability via TruConfirm, prioritizes the queue, recommends mitigations, re-validates after a human takes action. Does NOT deploy patches. Does NOT reboot.
3. Agentic NL interfaces over existing patch tools¶
- CrowdStrike Charlotte AI + Falcon for IT — Charlotte explains why a CVE is prioritized, Falcon Fusion SOAR triggers playbooks. CrowdStrike explicitly calls this "bounded autonomy": crowdstrike.com/en-us/platform/charlotte-ai/. Not picking CVEs, not driving an apply/verify loop.
- Microsoft Security Copilot in Defender — Security Analyst Agent and Alert Triage Agent are investigation/triage agents. No autonomous Windows Update/WSUS loop. techcommunity.microsoft.com/blog/microsoftthreatprotectionblog/security-copilot-in-defender-empowering-the-soc-with-assistive-and-autonomous-ai
- Tanium ServiceNow AI Agent — chat interface recommending reboot/uninstall; operator clicks. Explicitly human-in-loop.
- Ivanti Neurons agentic AI — an ITSM/helpdesk persona. Not a patch decider.
4. Rule-driven automation with AI-assisted scheduling¶
- Tanium Autonomous Patch Management + Closed-Loop Exposure Remediation (RSAC 2026) — "kicks off OS/software patching workflows directly from the risk analysis interface." Humans still approve. Coverage: tanium.com/solutions/autonomous-patch-management/ and securityboulevard.com/2026/03/tanium-adds-ai-governance-ot-endpoint-support-and-closed-loop-remediation-at-rsac-2026/
- Ivanti Neurons (2026) — out-of-band patch deployment, auto-rescan after patch, reboot-when-required. The closest any vendor comes to apply+verify. But CVE selection is policy-driven, not LLM-reasoned. Docs: help.ivanti.com/ht/help/en_US/CLOUD/vNow/patch-management.htm
- Automox — "policy-driven, human-controlled automation." No LLM in CVE selection path. automox.com
- Action1, Syxsense, NinjaOne, ManageEngine, BigFix, Datto, ConnectWise, Lansweeper — policy-based scheduled patching. No LLM agent in the decision loop. Action1's RSAC 2026 integrations coverage explicitly names the gap: prnewswire.com/news-releases/action1-expands-enterprise-ecosystem-at-rsac-2026.
5. SOAR playbook orchestration¶
- Palo Alto XSOAR (Demisto), Splunk SOAR (Phantom), Swimlane, Torq, Tines, Google SecOps SOAR (Siemplify), IBM QRadar SOAR — 2026 "AI-driven investigation agents" do root-cause and correlation. CVE playbooks exist but ticketize and orchestrate existing patch tools. An LLM doesn't execute dnf/apt on a host.
6. MDR/XDR autonomous response¶
- CrowdStrike Falcon, SentinelOne Singularity, Cortex XDR, Microsoft Defender for Endpoint — autonomous response means isolate / kill / quarantine, not package-manager remediation. SentinelOne and Automox explicitly partner because XDR doesn't patch.
7. Cloud-native patch orchestration¶
- AWS Systems Manager Patch Manager, Azure Update Manager, GCP OS Config — no agentic LLM layer announced through April 2026. Baselines + maintenance windows.
8. Startups + niche 2024-2026¶
- Vicarius vIntelligence (RSA 2026) — "agent-based AI layer, natural-language queries, creation of validation logic, remediation recommendations under a human-in-the-loop governance model." Recommends, doesn't execute. securitybrief.asia/story/vicarius-unveils-vintelligence-for-continuous-validation
- Seemplicity — "Exposure Action Platform," finds owner and routes fix. Ticketing/orchestration layer, not execution. msspalert.com/news/seemplicity-unleashes-ai-agents-to-find-resolve-risk-exposures
- Sevii Level 5 ADR — autonomous defense/remediation but scoped to EDR/identity/cloud threat response, not CVE package patching. sevii.com/news-release/sevii-launches-agentic-ai-adr-platform
- GitHub Copilot Autofix + Snyk Agent Fix — source-code patch generation. Not host remediation. Different regime entirely.
9. Most telling data point: Red Hat¶
- Red Hat Lightspeed MCP is explicitly read-only per Red Hat's own rationale: "to avoid introducing additional risk with autonomous AI decision-making." developers.redhat.com/articles/2025/10/13/using-ai-agents-red-hat-insights. Event-Driven Ansible + Lightspeed can automate via deterministic playbooks, not LLM reasoning. redhat.com/en/blog/red-hat-insights-cve-advisories.
The vendor of the distro we remediate publicly decided autonomous agentic package operations are too risky to ship. That's domain expertise we take seriously. It also explains why the gap exists: serious players considered the execution layer and chose not to build it.
Academic: LLM + vulnerability remediation¶
All 2024-2026 published work falls into one of three buckets:
1. Source-code vulnerability patching¶
- AutoPatch (arXiv 2505.04195) — multi-agent CVE framework. arxiv.org/html/2505.04195v1
- CodeMender (DeepMind) — AI agent for code security. deepmind.google/blog/introducing-codemender-an-ai-agent-for-code-security/
- SecureFixAgent (arXiv 2509.16275) — source patches, iterative refinement.
- Aardvark (OpenAI) — autonomous vulnerability discovery + patching. Source-level.
- LLM Agentic Workflow for IaC (IEEE Xplore 2025, paper 10965635) — IaC files specifically.
2. Vulnerability discovery / pentesting¶
- AutoPenBench (EMNLP Industry 2025) — exploit generation. Scores attackers, not defenders. aclanthology.org/2025.emnlp-industry.114/, arxiv.org/abs/2410.03225, code at github.com/lucagioacchini/auto-pen-bench.
- AutoPentest (arXiv 2505.10321).
- LLM Agents for CVE Verification (CEUR 3920) — Plan-and-Execute for VEX justification, not remediation.
- DARPA AIxCC finals (Aug 2025) — see ATLANTIS section below.
3. Operational CVE remediation on running hosts¶
- None. Not a single published paper or open-source project executes an autonomous scan → decide → apply → verify → iterate loop on a running host's package manager.
ATLANTIS (AIxCC 1st place, most relevant prior art)¶
Team Atlanta won DEF CON 33 AIxCC Finals ($4M prize, August 2025). Open source under Apache. Paper at arxiv.org/abs/2509.14589, repo at github.com/Team-Atlanta/aixcc-afc-atlantis. Architecture (relevant subset — §7 ATLANTIS-Patching):
- Ensemble of agents over a single agent (§7.1.2). They ran 5 diverse agent designs in parallel, took the first valid patch. "No single agent consistently outperformed all others; the best- performing agent varied depending on the task."
- Two-level policy enforcement (§7.1.2). Policies written into each agent's prompt AND re-validated rule-based after return. Prevents "plausible but incorrect patches" — e.g., removing functionality entirely or suppressing errors through a large try-catch block.
- CRETE (§7.3) — shared harness library. Environment /
Evaluator / CodeRetriever / FaultLocalizer; agents contribute
strategy only. Our
SkillRuntimeProtocol is the same pattern. - State-machine workflow (§7.6 Algorithm 5 MULTIRETRIEVAL) —
single agent state machine
EVALUATE → ANALYZE_ISSUE → RETRIEVE → EVALUATE → DONE, capped atmax_n_evals = 10. Our Ralph loop is the same shape with different action names. - Verdict vocabulary (§7.6.2):
PLAUSIBLE / UNCOMPILABLE / VULNERABLE. Our CVE equivalent:APPLIED / STILL_VULNERABLE / NEEDS_REBOOT / RPM_CONFLICT / POLICY_VIOLATION / HEALTH_FAILURE. - PRISM multi-team shape (§7.7) — Supervisor + Analysis + Patch + Evaluation teams. Same shape as our Architect + Worker
- Auditor + Reflector, independently converged.
- EvaluationReporter (§7.7.3) — "generates comprehensive reports for failed patches, analyzing failure patterns and providing actionable feedback." This is our Reflector.
- Stack-trace dedup (§3.2, §7.2.1) — dedup by structured fingerprint, not LLM similarity. Explicitly rejected LLM-as-judge and self-consistency.
- Property analysis step (§7.8.5 VINCENT) — before patching, LLM enumerates invariants the program must preserve. "LLMs could infer properties without any specialized tools or a detailed prompt that explains what a property is." We adopt this in the Architect prompt.
What ATLANTIS doesn't do that we do: any long-term memory or vector store. §7 has no such component. Our dream pass + V2 tips + mechanism field architecture is actually novel relative to the $4M AIxCC winner. Worth stating in the whitepaper.
What ATLANTIS does that we don't (and won't): symbolic/concolic execution (§5.7, §6.4), directed fuzzing (§4.6, §5.5), FDP encoders (§6.9.3), PoV generation (§5.8), LSP/tree-sitter code indexing (§6.9, §7.8.3), SARIF assessment (§9), custom GRPO-trained retrieval LLM (§8). All "vulnerability discovery" work. We consume Vuls output; Vuls does the discovery.
Trail of Bits post-mortem (finalist 2nd place): blog.trailofbits.com/2025/08/07/aixcc-finals-tale-of-the-tape/ Useful operational commentary on what broke during the competition.
Benchmarks: fit-checked for our shape¶
CVE-Bench (Wang et al., NAACL 2025)¶
509 CVEs across Python / Java / JavaScript / PHP. aclanthology.org/2025.naacl-long.212/. Not applicable to us. 100% source-code patch benchmark. Zero kernel, glibc, openssl, httpd, openssh, systemd — zero RPM-delivered components. SWE-agent 21% ceiling is against 15 Python CVEs in a subset. Harness expects diff patches applied via git, not advisory IDs applied via dnf.
Not to be confused with UIUC CVE-Bench (arXiv 2503.17332, ICML 2025) — different benchmark, exploit-oriented, also doesn't fit.
AutoPenBench (Gioacchini et al., EMNLP Industry 2025)¶
33 vulnerable-system tasks on Dockerized VMs.
github.com/lucagioacchini/auto-pen-bench.
Not applicable. Scores attackers (CTF-style flag capture), not
defenders. Harness expects pentesting tool calls
(ExecuteBash / SSHConnect), not advisory remediation.
Repurposing would be a rewrite, not an integration.
VulnRepairEval (arXiv 2509.03331, September 2025)¶
23 Python CVEs. Not applicable. Python source patches against package source trees. Not Linux-distro-package level. No public GitHub repo as of April 2026.
SEC-bench (arXiv 2506.11791)¶
Closer to "validated security patches" but still source-level.
ZeroDayBench (arXiv 2603.02297)¶
Production codebases, source-level patches.
LiveCVEBench (livecvebench.github.io)¶
Contamination-free continuous benchmark; still source-code.
The positioning¶
The benchmark landscape sorts into three buckets:
| Bucket | Examples | Our overlap |
|---|---|---|
| Source-code patch | CVE-Bench, VulnRepairEval, SEC-bench, ZeroDayBench, LiveCVEBench | Zero |
| Exploit generation | AutoPenBench, UIUC CVE-Bench (ICML) | Zero (wrong role) |
| Autonomous host remediation via package manager | nothing published | That's us |
Our contribution: build our own reproducible corpus (Vuls-before / Vuls-after on Rocky 9 with a known RHSA/RLSA set) and publish evaluation methodology. Cite CVE-Bench et al. in related-work as adjacent regimes.
Public CVE data programmatic access¶
Validated working against our setup, April 2026:
- NVD API 2.0 — rate limit 5 req/30s unkeyed, 50/30s keyed.
NIST recommends bulk refresh ≤once per 2 hours. Don't do
per-run / per-host pulls from live NVD. Mirror locally via
go-cve-dictionary(which we do). nvd.nist.gov/developers/request-an-api-key - Red Hat Security Data — CSAF per RHSA + VEX per CVE since
2024-07-10. REST:
https://access.redhat.com/hydra/rest/securitydata/csaf.json. VEX tree:https://access.redhat.com/security/data/csaf/v2/vex/. Signed + hashed; no auth required for read. redhat.com/en/blog/csaf-vex-documents-now-generally-available, docs.redhat.com/en/documentation/red_hat_security_data_api - OSV.dev — free REST + bulk zips, 30+ ecosystem feeds including distro packages. No auth.
- MITRE CVE Services API 2.x — free. API key required for submissions; read is open.
- Rocky Linux — inherits Red Hat's advisory data but emits
RLSA-prefixed IDs (RLSA-YYYY:NNNN, syntactically identical to
RHSA). Available via
dnf updateinfodirectly on any Rocky host with the configured security repos.
Tooling we consume (not build)¶
- Vuls (vuls.io, github.com/future-architect/vuls) — agentless SSH CVE scanner. Supports Rocky / Alma / RHEL / Ubuntu / SUSE / Debian / Amazon / Oracle. Outputs JSON. Handles NVD + OVAL + RHSA data ingestion, rate-limited NVD refresh, multi-distro normalization. We consume the JSON output.
- Trivy, Grype + OSV-Scanner — container-focused equivalents. Not relevant to our host-remediation regime.
- Ansible
needs-restarting,reboot,wait_for_connection— the mechanics of reboot-verify patterns already exist in Ansible. The autonomous loop composing them isn't published.
What we adopt, differ from, and leave alone¶
Summary relevant to the GemmaForge CVE skill as of today:
Adopt from ATLANTIS¶
- Verdict vocabulary (PLAUSIBLE / UNCOMPILABLE / VULNERABLE) → extended to our RPM domain (APPLIED / STILL_VULNERABLE / NEEDS_REBOOT / RPM_CONFLICT / POLICY_VIOLATION / HEALTH_FAILURE).
- Two-level policy enforcement (prompt + programmatic check) —
prevents
dnf remove/systemctl maskas fake-fixes. - Property analysis prompt step (VINCENT) — Architect enumerates invariants before apply.
- Per-(package, advisory) dedup with retry cap ~10.
- Deterministic-evaluator first-valid-wins — Vuls re-scan is the judge; don't add LLM-as-judge on top.
Adopt from commercial (by negation)¶
- Scope restriction to bounded advisory sets (not "upgrade everything" — Automox does that, we don't).
- Package-inventory diff as policy check (ATLANTIS motivated, Red Hat's read-only stance reinforced).
- Operator approves run scope, agent handles execution — same division of labor as Ivanti but with LLM reasoning in the execution layer that Ivanti doesn't have.
Differ deliberately¶
- Long-term memory layer (dream pass + V2 tips + mechanism field) — ATLANTIS has no such component. Novel contribution.
- Autonomous apply → reboot → verify loop — no published pattern anywhere. Novel contribution.
- Consume Vuls output rather than build our own scanner. Different from ATLANTIS (which built custom static analysis) because our problem doesn't need discovery, only remediation.
Leave alone (explicitly)¶
- Source-code patching (Copilot Autofix's lane).
- Symbolic execution / fuzzing / PoV generation (ATLANTIS's lane, vulnerability discovery not remediation).
- 5-agent parallel ensemble (requires DEF-01 clutch + UI).
- Custom-trained retrieval model (overkill for our domain).
- Chasing CVE-Bench SOTA (different regime — we build our own corpus).
Citations summary¶
Everything referenced above, grouped:
ATLANTIS: - Paper, arXiv 2509.14589 - Repo, github.com/Team-Atlanta/aixcc-afc-atlantis - DARPA results announcement - Trail of Bits post-mortem
Academic CVE work: - AutoPatch (arXiv 2505.04195) - CVE-Bench NAACL 2025 - UIUC CVE-Bench ICML 2025 - SecureFixAgent (arXiv 2509.16275) - CodeMender (DeepMind blog) - AutoPenBench EMNLP Industry 2025 - AutoPentest (arXiv 2505.10321) - VulnRepairEval (arXiv 2509.03331) - LiveCVEBench
CVE data sources: - NVD API key policy - Red Hat CSAF/VEX GA - Red Hat Security Data API docs - Vuls documentation
Commercial references cited: - Qualys Agent Val - Tanium Autonomous Patch Management - Tanium Closed-Loop (RSAC 2026) - Ivanti Neurons (2026) - Ivanti patch management docs - CrowdStrike Charlotte AI - Falcon for IT - Microsoft Security Copilot in Defender - Red Hat Lightspeed MCP - Red Hat Insights CVE advisories - Automox - Action1 RSAC 2026 - Vicarius vIntelligence - Seemplicity agentic - Sevii Level 5 ADR - Snyk Agent Fix - GitHub Copilot Autofix agent mode
Recorded: 2026-04-19 during the CVE-skill pivot (entry 33). Revisit this document when a new vendor announces autonomous host-remediation — claims in section 9 (Red Hat's read-only stance) are the load-bearing assertion that dates fastest.