Improvement #5: Conversation History Management¶
Problem¶
On high-attempt rules (13+ attempts), accumulated tool call/result
pairs push past vLLM's 16K context limit, causing PromptTooLongError.
The prompt budget assembler (Improvement #3) controls instruction
tokens, but conversation history — SSH commands and their multi-line
output — grows unbounded within a rule attempt sequence.
8 errors in the second overnight run, all on rules at attempt 11+.
install_smartcard_packages hit 5 consecutive context errors in
5 minutes, making attempts 11–16 completely wasted.
Root cause¶
Each worker turn produces a tool call (the SSH command) and a tool
result (the command output). Tool results from ssh_apply can be
50–200 tokens (command output, error messages, multi-line config
diffs). After 12 turns, the accumulated history can reach 3,000–5,000
tokens — enough to push the total prompt past 16K when combined with
the system instruction and the current turn.
Proposed mechanism¶
Sliding window with compression. Before each worker turn:
- Count total conversation tokens (system + history + current turn).
- If under budget (e.g., 12K of 16K), proceed normally.
- If over budget, compress the oldest turns:
- Keep the last 3 turns verbatim (the model needs recent context to avoid repeating itself).
- Replace earlier turns with a one-line summary:
"[Turn 4: ran 'sed -i ...' on /etc/aide.conf — failed, config parse error]" - This is the same distillation pattern as episodic memory, but applied to within-rule conversation.
What this is not¶
This is not about increasing max_model_len — that costs VRAM and
doesn't solve the underlying problem of unbounded history growth.
It's also not about the prompt assembler, which already works
correctly for its scope.
Verification¶
- Inject a rule that requires 20+ attempts in a test.
- Confirm no
PromptTooLongErrorafter turn 15. - Confirm the model can still avoid banned patterns (it has access to the ban list in the instruction, not in conversation history).