Skip to content

Building the CVE Skill in a Day

The CVE pivot was decided at 10:23 this morning. The ATLANTIS paper was open by 10:45. The CVE-Bench deep-read had ruled out scoring-against-the-literature by 11:15. The Vuls database was pulling by noon. A first scan returned 153 CVEs from the mission-app VM at 14:11. The MVP smoke remediated 3 advisories end-to-end at 14:30. This is the build-log entry — what landed in the codebase, why those specific choices, and what the first full CVE run is supposed to test.

Entry 34 covered Run 6's post-mortem and explained why CVE as the second skill was the right call before more runtime tuning. This entry is the predictions-before-outcomes counterpart: the full CVE run hasn't happened yet. What follows is the bet the architecture is about to try to make good on.

What changed in the harness

Three files. Specifically:

  • gemma_forge/harness/interfaces.py — added three values to the FailureMode enum: NEEDS_REBOOT, RPM_CONFLICT, POLICY_VIOLATION. The enum already had HEALTH_FAILURE / EVALUATOR_GAP / FALSE_NEGATIVE / CLEAN_FAILURE from STIG; the three new ones extend vocabulary for CVE-specific verdicts without touching any harness routing code. The ATLANTIS paper (§7.6.2) uses a PLAUSIBLE / UNCOMPILABLE / VULNERABLE triple; we extended ours in the same shape.

  • gemma_forge/harness/ordering.py — added the deferrable_reboot predicate and the wildcard rule_id: "*" dispatch in filter_deferred. STIG's ordering constraint targets one specific rule (audit_rules_immutable) with category_nearly_complete; CVE's targets any work item whose metadata says requires_reboot: true. The wildcard dispatch means specific and broadcast constraints can coexist without the filter becoming hairy. Seven existing tests still pass.

  • gemma_forge/harness/ralph.py_build_skill_runtime dispatches to CveSkillRuntime when skill.skill_dir.name == "cve-response". Auto-consolidation's skill-dir map gained one entry ("cve" → "cve-response") after the smoke run caught that as a crash. That map is a DEF — not clean architecture, but it's a one-line extension not a restructure.

The main Ralph loop itself — rule selection, prompt assembly, Architect / Worker / Evaluator / Reflector cycle, memory hydration, attempt bookkeeping, run-end consolidation — zero edits. That's the test the skill-agnostic thesis was built to pass.

What changed in the skill

Scaffolding, all expected:

  • skills/cve-response/skill.yaml — manifest with tool names, validators, ordering_constraints block declaring deferrable_reboot, cve config block (Vuls config path, scan mode, severity filter, package allowlist), and skill-specific UI labels.
  • skills/cve-response/runtime.pyCveSkillRuntime / CveWorkQueue / CveExecutor / CveEvaluator / CveCheckpoint mirroring the STIG shape, wired to the Vuls + dnf advisory tool wrappers.
  • skills/cve-response/prompts/ — four prompts (architect, worker, reflector, auditor) adapted from STIG with CVE-specific framing, ATLANTIS §7.8.5 property analysis step in the Architect, ATLANTIS §7.1.2 two-level policy rules in both Architect and Worker.
  • skills/cve-response/DESIGN.md — the build-notes document promoted out of gitignored drafts into tracked material living alongside the skill.
  • skills/cve-response/validators/mission_app.yaml — same healthcheck as STIG; mission mustn't break.
  • migrations/cve/000[1-5].sql — STIG's five migrations cloned with schema rename, applied cleanly to a fresh forge_cve role.

What the external tooling gave us

Two harness-level tool modules, both thin wrappers around well-maintained external tooling:

  • gemma_forge/harness/tools/vuls.py — wraps the Vuls Docker container for scan + report + JSON parse. Vuls handles OS detection, multi-source CVE data ingestion (NVD + OVAL), rate-limited NVD refresh, package-manager normalization. We consume its JSON. The wrapper's most material choice is the is_reboot_required_advisory heuristic — based on affected package names (kernel, glibc, systemd, dbus) — which feeds the deferrable_reboot predicate. That heuristic is skill-local and explicit; if it mis-classifies, the predicate still defers correctly on needs-restarting -r's runtime answer.

  • gemma_forge/harness/tools/dnf_advisory.py — wraps dnf upgrade --advisory=<ID> over SSH. Input validation against the R[LH]SA-YYYY:NNNN regex blocks shell-injection via malformed IDs. Parses dnf's output for upgraded-packages list, reboot hints, and the "unknown advisory" signal. Also provides list_pending_advisories, check_needs_reboot, installed_package_diff — each used in a specific Evaluator decision path.

The installed_package_diff function is the ATLANTIS §7.1.2 policy check: if a baseline package was removed (rather than upgraded), the Evaluator returns POLICY_VIOLATION regardless of whether the scanner is happy. Stops the functionality-suppression failure mode that ATLANTIS explicitly called out.

The five patterns we stole from ATLANTIS

Entry 33 named them; worth being specific about how they landed:

  1. Verdict vocabulary (§7.6.2) — their PLAUSIBLE / UNCOMPILABLE / VULNERABLE became our APPLIED / STILL_VULNERABLE / NEEDS_REBOOT / RPM_CONFLICT / POLICY_VIOLATION / HEALTH_FAILURE. The CveEvaluator composes them from multiple signals (Vuls re-scan, mission health, needs-restarting -r, package diff).

  2. Two-level policy enforcement (§7.1.2) — rules written into the Architect prompt (Never use dnf remove, Never use systemctl mask, Never --nobest without acknowledged cause) AND re-validated programmatically in the Evaluator via the package-inventory diff. Two fences because the first one fails.

  3. Property analysis step (§7.8.5 VINCENT) — Architect prompt requires enumerating package invariants before picking an advisory. "openssh-server must continue to accept key-based auth for the adm-forge user" is the shape. The ATLANTIS finding was "LLMs can infer properties without specialized tools or a detailed prompt explaining what a property is"; we're testing whether that holds for Rocky system administration domain.

  4. Per-(package, advisory-ID) dedup with retry cap (§3.2) — the retry budget is ~10 per advisory (same primitive STIG uses per rule). Dedup is by advisory ID, not LLM similarity. Deterministic structural fingerprint, matching their pattern.

  5. Prompt-archetype rotation on repeat failure (§7.11.3) — our existing architect reengagement mechanism (PIVOT / CONTINUE / ESCALATE) implements this. The ATLANTIS finding says "no single agent consistently outperforms; rotate on failure"; our Architect can signal PIVOT to change the Worker's strategy mid-rule, which is the same pattern compressed into one LLM with changing context rather than N LLMs with distinct designs.

The six patterns we deliberately skipped

Also worth being specific:

  1. 5-agent parallel ensemble — requires the clutch (DEF-01) which requires UI work. Deferred properly.
  2. Symbolic / concolic execution — vulnerability discovery, not remediation. Vuls does our discovery.
  3. Directed fuzzing — same.
  4. Custom GRPO-trained retrieval model — overkill for our domain; Gemma's context window fits the errata text.
  5. Per-language specialist agents — we have one substrate (Rocky 9 RPMs); no need to split.
  6. K8s orchestration on Azure — we have one VM target.

The deliberate skips matter as much as the adoptions. They're the scope-bounding that makes "one day of work" plausible. If we'd tried to adopt the symbolic execution stack we'd be at week three.

The MVP smoke did not validate the thing that matters

The smoke remediated 3 of 3 advisories on attempt 1. Duration ~3 minutes. Zero escalations, zero reboot-required advisories picked, zero scanner gaps, zero architect re-engagements, zero Reflector failure analyses. One success-mode tip emitted with a valid mechanism field.

That result validated the plumbing — the scan → apply → verify → log path works end-to-end against a live VM. It did not validate the harness's architectural value. The harness exists to handle the failure modes; when everything succeeds first try, the reflexion loop is doing nothing.

The harness is engineered for the hard 10%. On the easy 90%, the harness is indistinguishable from ansible-playbook. The hard 10% is what the full run is for.

What the full CVE run is supposed to test

Four predictions, each falsifiable:

  1. The deferrable_reboot predicate fires in anger. The widened corpus (Critical + Important + Moderate) will have 15-25 reboot-required advisories depending on Vuls's severity classification. The Architect should see them filtered out of the candidate pool until the last non-reboot advisory is done, then a batched reboot pass applies them together. If we see kernel RLSAs getting picked early and causing cascade-style failures, the predicate has a bug we haven't caught.

  2. At least some advisories fail, and the Reflector produces useful mechanism tips. If 100% succeed first-try on 40+ advisories, either Vuls is over-reporting or the corpus isn't representative enough to test the harness. Expected failure modes: dependency conflicts, GPG key issues, reboot-required verification timing. Each should produce a Reflector analysis with a mechanism that would help a future run.

  3. Cross-run memory starts empty and accumulates cleanly. CVE schema has zero prior-run tips. The V2 retrieval should gracefully return empty on every first-attempt prompt, the skill should function without tip retrieval, and by end of run some tips should have at least 1 outcome recorded. Starting the second CVE run, we should see non-trivial tip retrieval.

  4. Wall-time stays under 5 hours. CVE advisories are fundamentally faster than STIG rules (dnf upgrade is seconds vs oscap-remediate chain). 44 advisories × estimated 30-120s typical + some reboot overhead shouldn't push past 5h. If it does, we have a different runtime story than the STIG one — worth understanding why.

The bet I most want to be right about is #1. The ordering-constraint mechanism's generalization from category_nearly_complete (STIG) to deferrable_reboot (CVE) is the load-bearing skill-agnostic claim. If the cascade pattern survives the generalization, Track A can be designed against two skills' data with confidence. If it breaks, DEF-02's pattern was STIG-shaped and we didn't catch it.

Things I'll be watching for on dashboards + logs

Specific signals during the run:

  • rules_deferred events should fire every iteration while non-reboot advisories remain in the pool, deferring all the reboot-required ones. When the pool crosses the "only reboot-required left" threshold, deferrals should stop and the reboot-required advisories should become pickable.
  • consolidation_complete at run-end with dream pass + eviction results. First-ever CVE dream pass; expect category-level credit assignment. Eviction will likely retire nothing on Run 1 (not enough tip-retrieval history accumulated).
  • POLICY_VIOLATION verdict if it ever fires — the ATLANTIS two-level enforcement check. If the Worker ever tries dnf remove to clear an advisory, the evaluator should catch it.
  • NEEDS_REBOOT verdict on kernel advisories — partial-success (value=0.5, confidence=0.8) in the OutcomeSignal. Tip utility accrues differently for partial successes.
  • Mechanism-field acceptance — if it drops below 95%, something about the Reflector prompt's CVE framing isn't holding up.
  • journey/34 — the Run 6 post-mortem immediately preceding this build. The architectural claims this entry puts on the record stand on Run 6's validation of the mechanisms they're built on.
  • journey/33 — the pivot + research that chose CVE as the second skill. This entry is the concrete answer to that entry's "what we're about to make good on."
  • docs/research/cve-agent-landscape-2026-04.md — the ATLANTIS / commercial / benchmark landscape the design decisions referenced here were measured against.
  • skills/cve-response/DESIGN.md — the specific technical design decisions (evaluator verdict semantics, Vuls integration, evaluator re-scan cost trade-offs).
  • deferred.md — Track A runtime tunings stay deferred until the full CVE run gives us two-skill data to inform them. DEF-01 (clutch) stays gated on UI work regardless.