The Virsh Console Fallback: Out-of-Band When SSH Fails¶
The story in one sentence¶
A STIG fix can harden SSH, modify PAM, or break sudo — in other words, it can disable the primary control channel at exactly the moment the harness needs to undo the damage. I added a virsh console fallback as an out-of-band recovery path, and later discovered that the fallback itself had subtle bugs, which became its own lesson in layered recovery.
The problem¶
The harness drives the target VM via SSH. The Worker's apply_fix
tool runs as sudo bash -c '<script>' over an asyncssh
connection. This works great right up until the moment a fix does
something to the SSH daemon or to the sudo / PAM configuration
that breaks the next SSH-as-root call.
A partial list of things a STIG fix can legitimately do that break SSH:
- FIPS mode enablement — strict FIPS mode rejects several cipher suites that OpenSSH might otherwise negotiate. If the client's cipher list doesn't overlap with the new server restrictions, SSH connection fails.
- Firewall hardening — a STIG rule that tightens
firewalldcan accidentally drop the administrative SSH port. - sshd_config restrictions — MaxAuthTries, AllowUsers, disabling password auth, changing the listen port. Any of these can lock out the harness account if the fix doesn't preserve the admin exception.
- Modifying
/etc/pam.d/sudo— a STIG rule about PAM configuration can break non-interactive sudo, which is what the harness uses. This is exactly what happened withsudo_remove_nopasswdin the overnight run (journey/14Finding 3). - Changing
/etc/sudoersor/etc/sudoers.d/*permissions — making the file group-writable or world-writable causessudoto refuse to run with a safety abort.
In every one of these cases, the in-band channel (SSH+sudo) is exactly the thing the fix just disabled, and the harness can't use the same channel to recover.
The solution¶
libvirt provides virsh console — a direct attachment to the
VM's virtual serial console from the hypervisor. This is an
out-of-band channel: it operates below the VM's network
stack, authentication, and userland. Nothing the guest OS does
can break this path because the path is implemented at the
hypervisor layer, and the hypervisor is on the host, which the
guest doesn't have access to.
I added run_via_console() in gemma_forge/harness/tools/console.py:
- Spawn
sudo virsh console <domain-name> --forceas a subprocess - Write the command to its stdin, wrapped in start/end marker strings so we can parse the output
- Capture stdout until we see the end marker
- Extract the command's stdout and exit code from between the markers
- Return
(stdout, stderr, returncode)in the same shape as the SSH-based runner
The intent was that _run_ssh() would try SSH first, and on
failure (connection error, auth error, timeout) would
automatically fall back to _run_via_console(). Same interface,
different channel, transparent to the callers.
Why the fallback exists at the infrastructure layer¶
A few architectural notes about why this lives where it does:
- It's at L1 because libvirt is the infrastructure layer. The fallback is a property of the hypervisor, not of the harness. Even a totally different harness running against a libvirt-managed VM could use the same fallback.
- The harness code in L4 just sees "a run_ssh function that might fall back." The abstraction doesn't leak. The outer loop doesn't know or care which channel executed a given command.
- The snapshot restore uses the same out-of-band principle.
virsh snapshot-revertalso operates at the hypervisor level and is immune to guest-level damage. The console fallback and the snapshot restore are both expressions of the same idea: when the in-band control plane is damaged, use a control plane that lives below it. Seeimprovements/04-snapshot-based-revertfor the snapshot half of the story.
The twist: the console fallback itself has a bug¶
This is the part worth writing down honestly, because it's a good reminder that defensive fallbacks are only as good as their implementation.
During the Tier 3 test pass (see
journey/15), I
deliberately broke sudo on the target VM and verified that the
diagnostic gather still produced accurate forensics. The primary
SSH-via-sudo path correctly detected that sudo was broken. The
code then attempted to fall back to the virsh console path — and
the console fallback crashed with "Connection lost" during
the subprocess communication.
What's likely happening: the virsh console subprocess protocol
uses a PTY-style interaction that doesn't cleanly map to Python's
asyncio.create_subprocess_exec with stdin/stdout pipes. The
console session expects an interactive terminal, and the pipe-
based approach loses the connection partway through the command
wrapper. A more robust implementation would use pexpect-style
PTY handling, or a completely different channel like the QEMU
guest agent (virsh qemu-agent-command), or a hypervisor-side
sidecar that writes diagnostics to a shared volume.
Why I left it as a known limitation¶
The console fallback could have been fixed properly in the same pass that discovered the bug. I deliberately didn't, for three reasons:
-
The primary recovery path is still sound. The snapshot restore at the libvirt level works regardless of whether the console fallback works, because
virsh snapshot-revertdoesn't need to communicate with the guest at all. So even with a broken console fallback, the harness can recover fully from any guest-level damage. -
The diagnostic signal is still accurate where it matters most. When sudo is broken, the primary SSH-via-sudo probe reports
sudo_ok: Falsewith very high confidence (because the sudo failure comes through as rc=1 with "a password is required" in stderr). The Reflector gets enough to reason about. The richer forensics that would come through the console path (nginx status, postgres status, journal errors) are nice-to-have, not load-bearing. -
The gap is documented honestly. This is the important part. The limitation is called out explicitly in
architecture/01-reflexive-agent-harness-failure-modesunder "Known limitations." A reader of the failure-modes piece knows exactly where this implementation is incomplete and what the cleaner alternatives are (QEMU guest agent, hypervisor sidecar).
The discipline here is: fix what is load-bearing, document what is not, don't pretend completeness you don't have. A defensive fallback that is documented as partial is much more useful than a defensive fallback that is silently broken.
What I learned¶
-
Any fix that can damage the control channel needs an out-of-band recovery plane. This is true whether you are remediating STIG rules, running chaos engineering against a web service, or deploying config to a network device. If the thing you are automating can break the thing you are automating with, you need a second channel at a lower layer.
-
"Layered" recovery means recovery paths that exist at multiple layers of the stack. SSH+sudo is in the guest (L1 of the guest stack). Virsh console is in the hypervisor (L0 relative to the guest). Virsh snapshot-revert is also in the hypervisor. The harness can attempt recovery at L1 first and escalate to L0 when L1 fails. If L0 also failed, there would be no recovery path on the box at all, and the right response would be to halt the run and alert the operator.
-
Documenting known limitations is a discipline, not a weakness. Listing "the console fallback is broken, here is the cleaner alternative" in the failure-modes doc is more credible than pretending the fallback works. Anyone who's reviewed architecture docs has seen overclaiming. Honest "here is what we built, here is what doesn't work yet, here is what we'd do instead" wins trust.
Related entries¶
journey/04-vm-provisioning— the libvirt setup that makes virsh console availablejourney/14-overnight-run-findings— the overnight run that demonstrated why in-band-only recovery fails under adversarial conditionsimprovements/04-snapshot-based-revert— the snapshot restore mechanism that is the other out-of-band recovery patharchitecture/01-reflexive-agent-harness-failure-modes— where this fallback's known limitations are documented formally