The Test Pass in Practice: Running 99 Property Tests Across 7 Tiers¶
The story in one sentence¶
After v3's five fixes were implemented, I ran an explicit 7-tier test pass — each tier testing an abstract property of the harness, each tier checkpointed before moving to the next. 99 tests ended up passing; several real bugs were caught along the way, and two architectural gaps were surfaced and documented honestly rather than hidden.
Relationship to journey/15¶
This is the run entry;
journey/15 is the
discipline entry. Journey 15 describes the reframe — "test the
abstract harness properties, not the specific bugs" — and the
checkpoint discipline that came out of it. This entry describes
what actually happened when the test pass ran under that
discipline. Same event, two angles: the epistemology in 15, the
receipts here.
The 7 tiers¶
Tier names are deliberate: each one is a property statement, not a description of an action.
| Tier | Property | Tests | Wall time | Dependencies |
|---|---|---|---|---|
| 1 | Pure helpers have bounded, correct behavior | 47 | 4s | none |
| 2 | Architect verdict parser is robust to real LLM output | 24 | 4s | none |
| 3 | Target state is observable and recoverable | 16 | 3m50s | VM |
| 4 | Agent turns are bounded in actions | 7 | 55s | LLM |
| 5 | The full inner loop is compositionally correct | 1 | 1m36s | LLM + VM |
| 6 | Fault paths degrade gracefully | 4 | 13s | mocks |
| 7 | Frontend smoke test against new event types | 1 | 12s | UI |
| Total | 100 | ~7 min |
Property-statement tier names were the key discipline change. The previous-iteration tier names looked like "test the v3 fixes" — action-oriented, bound to specific bugs, not really testable as properties. Changing to "test that agent turns are bounded in actions" made the test names falsifiable and the assertions much sharper.
What the tiers actually caught¶
Tier 1 — pure helpers¶
47 tests against assemble_prompt, est_tokens, detect_plateau,
EpisodicMemory, RunState.summary_for_architect, plus the
smaller helpers (_keyword_set, is_similar, categorize_rule,
reflection_first_sentence).
Caught: one real mismatch between the test data and the
plateau detection algorithm. I had written a test asserting that
three reflections sharing keywords like aide and configuration
should plateau, but the actual algorithm also requires `min_shared
= 3` content keywords, and the test data only had 2 truly shared words after stopword filtering and plural collapsing. The test was claiming more than the algorithm guarantees.
Fix: rewrote the test data to actually share 3+ content keywords. Also added a TODO to the plateau detector about future stem normalization (config / configuration) that would make it more permissive. Didn't fix the algorithm in this pass — that's a calibration improvement, not a correctness bug.
Takeaway: writing tests forces you to state your assumptions about the algorithm precisely. Assertions that felt reasonable turned out to claim something the algorithm doesn't guarantee. The tests are smarter than casual reading.
Tier 2 — architect verdict parser¶
24 tests against parse_architect_verdict, covering every
plausible format a real LLM might emit: clean, markdown-wrapped,
lowercase, extra whitespace, reordered fields, missing
NEW_PLAN, prefixed with explanatory prose, etc.
Caught: nothing — this tier passed on the first real run.
What it did do was force a small refactor before testing:
parse_architect_verdict was originally inline in the inner
loop, which made it un-testable in isolation. Writing these
tests required extracting it into a module-level function. The
extraction took 5 minutes; the function is now testable, named,
and reusable. This is a direct example of the test discipline:
the test demanded a refactor, and the refactor was the right
shape anyway.
Takeaway: when a test forces you to extract a function, don't resist. The function wants to be extracted. The test pressure is revealing structure the code already wanted.
Tier 3 — target layer (VM-only)¶
16 tests against gather_environment_diagnostics,
snapshot_save_progress, snapshot_restore_progress, and the
interaction between them. The tests deliberately break the VM
in multiple distinct ways (stop nginx, stop postgres, remove
NOPASSWD, corrupt /etc/hosts) and verify that the diagnostic
gather correctly identifies each broken state AND that snapshot
restore recovers from each.
Caught two real bugs:
-
The substring bug in
mission_healthy. The diagnostic parser computedmission_healthy = "HEALTHY" in hc_text— and "UNHEALTHY" contains "HEALTHY" as a substring, so a broken target was incorrectly reported as healthy. This is a meta-instance of failure mode #3 (diagnostic blindness): the diagnostic layer itself had diagnostic blindness. Fixed by requiring"HEALTHY:"at the start of a line AND"UNHEALTHY"absent anywhere. -
The sudo-break technique was wrong. The first attempt to break sudo in the test was
Defaults:adm-forge targetpw, which I thought would require the target user's password for sudo. It doesn't override per-userNOPASSWD:directives, which is what the cloud-init sudoers file actually has. The test wasn't breaking sudo at all. Fixed by switching tosed -i 's/NOPASSWD://g' /etc/sudoers.d/90-cloud-init-userswhich actually removes the NOPASSWD directive.
Also surfaced: the virsh console fallback is broken
("Connection lost" from the subprocess protocol). The test
exposed this because when sudo is broken and the diagnostic
gather falls back to console, the fallback doesn't work.
Documented as a known limitation in
architecture/01-reflexive-agent-harness-failure-modes.
The snapshot restore path at the libvirt level still works, so
the primary recovery is intact; the console fallback is a
degraded but not fatal gap.
Takeaway: testing against deliberate real-world breakage catches things that happy-path testing misses. The substring bug would have shipped to the overnight run unnoticed otherwise.
Tier 4 — agent behavior (real LLM)¶
7 tests against the Worker tool-call cap using a synthetic
always_fail_tool / always_succeed_tool pair — not against
the real apply_fix tool, because the point is that the cap
works for any tool, not specifically for apply_fix. Plus
two tests for the Reflector's DISTILLED: field production
against real LLM output.
Caught: from __future__ import annotations at the top of
the test file broke the ADK FunctionTool parser. This is the
third time this exact gotcha has bit the project (documented
as gotchas/adk-future-annotations).
Fixed by removing the import; added an explicit # do not add
this import comment to the top of the file so it stops
happening. The third occurrence is the one where you turn it
into a written rule.
Everything else passed. The loose-prompt test (Worker with an
instruction to retry on failure) correctly triggered the cap on
the second tool call. The strict-prompt test (Worker with the
new production prompt that says "call it EXACTLY ONCE") showed
the Worker voluntarily stopping after one call, with the cap
never firing — which is exactly the desired
defense-in-depth behavior. The Reflector reliably produced the
DISTILLED: field in structured output.
Takeaway: test the generic invariant, not the specific case. The cap is a property of any agent with any tool, not specifically the Worker with apply_fix. Using synthetic tools enforces this.
Tier 5 — integration on one rule (real LLM + VM)¶
A single integration test that launches the full Ralph loop
against package_aide_installed (the simplest STIG rule in our
test corpus) and asserts the full expected event sequence:
snapshot_preflight → scan_complete → iteration_start →
rule_selected → attempt_start → prompt_assembled → agent_response
→ tool_call → tool_result → evaluation → remediated →
rule_complete.
Caught: nothing — the test passed on the first real run.
package_aide_installed was remediated on attempt 1 in 40
seconds, the Worker made exactly 1 tool call, and every expected
event fired in the expected order with the expected fields
populated. This was the single biggest "this works" moment of
the v3 pass — it's the first end-to-end integration verification
of all five fixes running together against real infrastructure.
Takeaway: when every component has been tested in isolation, one integration test catches the interaction bugs that component tests can't. A failure here would have flagged a composition issue somewhere; passing here confirmed the system was coherent.
Tier 6 — fault injection¶
4 tests covering the failure preconditions: baseline snapshot missing (should halt with clear error), diagnostic gather on a nonexistent host (should return structured all-false, not raise), progress snapshot missing (should fall back to baseline), vLLM unreachable (should fail cleanly on first call).
Caught: nothing — all fault paths behaved correctly. This is reassuring but not the interesting part of the pass. Fault paths will happen in production; having tests for them means when they happen, the loop handles them instead of crashing.
Takeaway: cheap defense. These tests run in 13 seconds and protect against the entire class of "a dependency isn't where we expected it" failures.
Tier 7 — frontend smoke¶
A single Playwright test that loads the dashboard against the Tier 5 integration run log and verifies the UI renders without console errors. Deliberately minimal — if the backend produces an event type the frontend doesn't know about, the frontend's generic fallback branch should handle it gracefully.
Caught: nothing. The dashboard rendered all the new v3
event types (prompt_assembled, rule_selected, attempt_start,
rule_complete, post_mortem, architect_reengaged, etc.)
without crashing, even though the dashboard components weren't
updated to display them specifically. They fell through to the
generic event log, which is the intended graceful degradation.
Takeaway: forward compatibility comes from generic fallback branches. The frontend doesn't need to know about every event type to be useful; it needs to handle unknown event types without crashing.
Running totals and what they mean¶
99 / 99 tests passing.
The number sounds cleaner than it is. Here's the honest breakdown:
- 95 tests passed on first run. 4 failed and needed fixes. Of those 4:
- 1 was a test data mismatch (Tier 1, plateau)
- 1 was a real harness bug (Tier 3, mission_healthy substring)
- 1 was a wrong break technique in the test itself (Tier 3, sudo break didn't actually break sudo)
-
1 was the ADK gotcha (Tier 4, future annotations)
-
0 tests were flaky. Every test that passes, passes every time. Every test that failed, failed every time until the real issue was fixed. This is partly because the tests are property-oriented (assertions about invariants, not about specific run sequences) and partly because the harness is instrumented well enough to produce deterministic output.
-
Two architectural gaps surfaced and documented, not hidden. The virsh console fallback bug and the plateau detection stem-normalization gap are both acknowledged limitations in the failure-modes document. Neither is a regression; both are "the implementation is imperfect but the property still holds via a different path." This kind of honest documentation is more useful than pretending completeness.
What this test pass says about the architecture¶
Three observations worth writing down:
-
The v3 fixes compose correctly. All five changes (Worker single-action, context budget, plateau detection, architect re-engagement, snapshot-based revert) work together without interactions. The integration test at Tier 5 is the proof.
-
Property-style tests are more durable than implementation-matching tests. Nothing in the test suite reads "assert
_run_agent_turncontains a specific for-loop." Every test asserts an externally-observable property. That means the tests survive refactors — when the v4 harness is written, most of these tests will still apply unchanged because the properties don't change even when the implementations do. -
The event stream is the test harness. Tier 5's integration test doesn't patch internal state — it launches the loop, reads the resulting JSONL, and asserts properties of the event stream. This pattern makes tests decoupled from implementation and means the same assertions can be replayed against any historical run log. It's worth more than it looks.
What came out of this pass¶
After the test pass completed, the deliverables were:
- A working v3 harness with five architectural changes verified
- A test suite that reads as a specification of the system's properties
- Two documented known limitations
- A failure modes document
- The discipline to do this again for v4 when the time comes
And, crucially for the narrative: a clean "before and after" story.
The overnight run at
journey/14 is the "before" —
a system with subtle flaws producing a lot of failure. The run
executing as these entries are being written is the "after" —
a system that's been tested and understood, grinding through the
same 270 STIG rules with all the v3 fixes active. Whatever
happens in that run, it's fresh data against fresh code, and
it's the story worth telling.
Related entries¶
journey/14-overnight-run-findings— the "before" run that produced the data these fixes addressjourney/15-the-test-as-architecture-discovery— the discipline conversation that framed this passjourney/17-v3-fix-pass— the narrative of the five fixes themselvesarchitecture/01-reflexive-agent-harness-failure-modes— the generalized taxonomy of what the tests verify