Skip to content

The Virsh Console Fallback: Out-of-Band When SSH Fails

The story in one sentence

A STIG fix can harden SSH, modify PAM, or break sudo — in other words, it can disable the primary control channel at exactly the moment the harness needs to undo the damage. I added a virsh console fallback as an out-of-band recovery path, and later discovered that the fallback itself had subtle bugs, which became its own lesson in layered recovery.

The problem

The harness drives the target VM via SSH. The Worker's apply_fix tool runs as sudo bash -c '<script>' over an asyncssh connection. This works great right up until the moment a fix does something to the SSH daemon or to the sudo / PAM configuration that breaks the next SSH-as-root call.

A partial list of things a STIG fix can legitimately do that break SSH:

  • FIPS mode enablement — strict FIPS mode rejects several cipher suites that OpenSSH might otherwise negotiate. If the client's cipher list doesn't overlap with the new server restrictions, SSH connection fails.
  • Firewall hardening — a STIG rule that tightens firewalld can accidentally drop the administrative SSH port.
  • sshd_config restrictions — MaxAuthTries, AllowUsers, disabling password auth, changing the listen port. Any of these can lock out the harness account if the fix doesn't preserve the admin exception.
  • Modifying /etc/pam.d/sudo — a STIG rule about PAM configuration can break non-interactive sudo, which is what the harness uses. This is exactly what happened with sudo_remove_nopasswd in the overnight run (journey/14 Finding 3).
  • Changing /etc/sudoers or /etc/sudoers.d/* permissions — making the file group-writable or world-writable causes sudo to refuse to run with a safety abort.

In every one of these cases, the in-band channel (SSH+sudo) is exactly the thing the fix just disabled, and the harness can't use the same channel to recover.

The solution

libvirt provides virsh console — a direct attachment to the VM's virtual serial console from the hypervisor. This is an out-of-band channel: it operates below the VM's network stack, authentication, and userland. Nothing the guest OS does can break this path because the path is implemented at the hypervisor layer, and the hypervisor is on the host, which the guest doesn't have access to.

I added run_via_console() in gemma_forge/harness/tools/console.py:

  1. Spawn sudo virsh console <domain-name> --force as a subprocess
  2. Write the command to its stdin, wrapped in start/end marker strings so we can parse the output
  3. Capture stdout until we see the end marker
  4. Extract the command's stdout and exit code from between the markers
  5. Return (stdout, stderr, returncode) in the same shape as the SSH-based runner

The intent was that _run_ssh() would try SSH first, and on failure (connection error, auth error, timeout) would automatically fall back to _run_via_console(). Same interface, different channel, transparent to the callers.

Why the fallback exists at the infrastructure layer

A few architectural notes about why this lives where it does:

  • It's at L1 because libvirt is the infrastructure layer. The fallback is a property of the hypervisor, not of the harness. Even a totally different harness running against a libvirt-managed VM could use the same fallback.
  • The harness code in L4 just sees "a run_ssh function that might fall back." The abstraction doesn't leak. The outer loop doesn't know or care which channel executed a given command.
  • The snapshot restore uses the same out-of-band principle. virsh snapshot-revert also operates at the hypervisor level and is immune to guest-level damage. The console fallback and the snapshot restore are both expressions of the same idea: when the in-band control plane is damaged, use a control plane that lives below it. See improvements/04-snapshot-based-revert for the snapshot half of the story.

The twist: the console fallback itself has a bug

This is the part worth writing down honestly, because it's a good reminder that defensive fallbacks are only as good as their implementation.

During the Tier 3 test pass (see journey/15), I deliberately broke sudo on the target VM and verified that the diagnostic gather still produced accurate forensics. The primary SSH-via-sudo path correctly detected that sudo was broken. The code then attempted to fall back to the virsh console path — and the console fallback crashed with "Connection lost" during the subprocess communication.

What's likely happening: the virsh console subprocess protocol uses a PTY-style interaction that doesn't cleanly map to Python's asyncio.create_subprocess_exec with stdin/stdout pipes. The console session expects an interactive terminal, and the pipe- based approach loses the connection partway through the command wrapper. A more robust implementation would use pexpect-style PTY handling, or a completely different channel like the QEMU guest agent (virsh qemu-agent-command), or a hypervisor-side sidecar that writes diagnostics to a shared volume.

Why I left it as a known limitation

The console fallback could have been fixed properly in the same pass that discovered the bug. I deliberately didn't, for three reasons:

  1. The primary recovery path is still sound. The snapshot restore at the libvirt level works regardless of whether the console fallback works, because virsh snapshot-revert doesn't need to communicate with the guest at all. So even with a broken console fallback, the harness can recover fully from any guest-level damage.

  2. The diagnostic signal is still accurate where it matters most. When sudo is broken, the primary SSH-via-sudo probe reports sudo_ok: False with very high confidence (because the sudo failure comes through as rc=1 with "a password is required" in stderr). The Reflector gets enough to reason about. The richer forensics that would come through the console path (nginx status, postgres status, journal errors) are nice-to-have, not load-bearing.

  3. The gap is documented honestly. This is the important part. The limitation is called out explicitly in architecture/01-reflexive-agent-harness-failure-modes under "Known limitations." A reader of the failure-modes piece knows exactly where this implementation is incomplete and what the cleaner alternatives are (QEMU guest agent, hypervisor sidecar).

The discipline here is: fix what is load-bearing, document what is not, don't pretend completeness you don't have. A defensive fallback that is documented as partial is much more useful than a defensive fallback that is silently broken.

What I learned

  1. Any fix that can damage the control channel needs an out-of-band recovery plane. This is true whether you are remediating STIG rules, running chaos engineering against a web service, or deploying config to a network device. If the thing you are automating can break the thing you are automating with, you need a second channel at a lower layer.

  2. "Layered" recovery means recovery paths that exist at multiple layers of the stack. SSH+sudo is in the guest (L1 of the guest stack). Virsh console is in the hypervisor (L0 relative to the guest). Virsh snapshot-revert is also in the hypervisor. The harness can attempt recovery at L1 first and escalate to L0 when L1 fails. If L0 also failed, there would be no recovery path on the box at all, and the right response would be to halt the run and alert the operator.

  3. Documenting known limitations is a discipline, not a weakness. Listing "the console fallback is broken, here is the cleaner alternative" in the failure-modes doc is more credible than pretending the fallback works. Anyone who's reviewed architecture docs has seen overclaiming. Honest "here is what we built, here is what doesn't work yet, here is what we'd do instead" wins trust.