Improvement: Deterministic Context Budget per Prompt¶
Status: Proposed, not implemented
Surfaced: 2026-04-11, analyzing overnight run findings
Priority: HIGH — deploy with the Worker single-action fix
Related: journey/14-overnight-run-findings.md Finding 4
The problem¶
Right now the harness assembles agent prompts by concatenating whatever state is relevant — system prompt, rule context, architect plan, episodic memory, semantic memory — and ships it to the LLM without any token budget enforcement. If the total prompt exceeds the model's context window (16K for our current Gemma 4 deployment), the LLM call crashes with HTTP 400.
In the overnight run we hit this 9 times, all in the deeper attempts of long-running rules. Each overflow was an entirely wasted LLM call.
With max_wall_time_per_rule_s: 1200 and an unbounded Worker retry loop,
the overflows were inevitable. Even after fixing the retry loop
(improvements/02), we'll want a hard guarantee that no prompt will ever
exceed the budget, so we can add new state to prompts (architect
re-engagement, richer run context, more distilled lessons) without fear.
The design¶
Phase 1 — Rough token estimation¶
We don't need exact tokenization (which requires the model's tokenizer). A rough estimate of "4 characters per token" is within 20% for English and code, which is fine for budget decisions. The estimator is a 1-line function:
Phase 2 — Prompt assembler with priority order¶
Replace the current ad-hoc string concatenation in the inner loop with an
assembler function that takes a list of (priority, label, content) tuples
and builds the final prompt up to a token budget:
def assemble_prompt(
sections: list[tuple[int, str, str]],
budget_tokens: int,
) -> str:
"""Assemble a prompt from prioritized sections within a token budget.
Sections are given as (priority, label, content). Lower priority numbers
are more essential — they are included first. If the budget is tight,
higher-priority-number sections are dropped or truncated.
Returns the assembled prompt as a single string.
"""
sorted_sections = sorted(sections, key=lambda s: s[0])
included = []
used = 0
for prio, label, content in sorted_sections:
est = est_tokens(content)
if used + est <= budget_tokens:
included.append((prio, label, content))
used += est
else:
# Try to truncate this section to fit
remaining = budget_tokens - used
if remaining > 100:
truncated = content[: remaining * 4 - 50] + "\n[... truncated for budget ...]"
included.append((prio, label, truncated))
used += est_tokens(truncated)
break
included.sort(key=lambda s: s[0])
return "\n\n".join(content for _, _, content in included)
Phase 3 — Priority order for each agent¶
For the Worker's attempt turn (the current context bomb):
| Priority | Section | Typical size |
|---|---|---|
| 0 | Current rule id + title | 100 chars |
| 1 | Directive: "call apply_fix once and return" | 200 chars |
| 2 | Architect's plan (first 400 chars) | 400 chars |
| 3 | Last 3 distilled lessons from episodic memory (this rule) | 600 chars |
| 4 | Top 5 banned patterns from semantic memory | 500 chars |
| 5 | Top 3 preferred approaches from semantic memory | 400 chars |
| 6 | Last 2 strategic lessons from semantic memory | 300 chars |
Total target: ~2500 chars ≈ 625 tokens for the user message. Plus system prompt (~1500 tokens) plus tool schema (~500 tokens). Plus generous margin for the in-turn tool round-trip (~2000 tokens). Target budget: 6–8K tokens max, well under the 16K limit.
For the Architect's rule-selection turn:
| Priority | Section |
|---|---|
| 0 | Current run summary: fixed/escalated/skipped counts |
| 1 | Top 10 remaining failing rules |
| 2 | Top 5 banned patterns |
| 3 | Top 3 lessons |
| 4 | Last 5 remediated rules (for pattern reference) |
| 5 | Full escalated rules list |
| 6 | Full remaining rules list (beyond top 10) |
Target: ~3000 tokens for the user message. Architect system prompt is bigger than Worker's (~2000 tokens). Total budget for architect turn: ~7K tokens.
For the Reflector's analysis turn:
| Priority | Section |
|---|---|
| 0 | Current rule id + title |
| 1 | Latest attempt's approach + result (truncated) |
| 2 | Prior 3 attempts' distilled lessons |
| 3 | Directive: structured output format |
| 4 | Full episodic history (if space allows) |
Target: ~2000 tokens user message. Reflector system prompt ~1500. Budget: ~5K tokens.
Phase 4 — Distillation for episodic memory¶
Instead of storing raw approach (200 chars) + result (80 chars) +
reflection (120 chars) per attempt in episodic memory, distill each failed
attempt into a single one-line lesson (~100 chars) produced by the
Reflector:
Attempt 3: Tried to edit /etc/aide.conf directly; failed because the AIDE
db needs --init after config changes. Reflector: use `aide --init --verbose`.
This is generated by adding a line to the Reflector's output requirement:
The distilled lesson is what's stored in episodic memory going forward, not the full raw text. The raw text is still in the event log for post-run analysis, but it doesn't pollute subsequent prompts.
With 3 distilled lessons × 100 chars = 300 chars = ~75 tokens for the full episodic context. Compare to the overnight run's episodic memory, which was 15–20 attempts × ~400 chars = 6–8K chars = 1.5–2K tokens. A 20× reduction.
Instrumentation¶
Every prompt assembly emits a structured event:
run_log.log("prompt_assembled", agent.name, {
"budget_tokens": budget,
"used_tokens": used,
"sections_included": [label for _, label, _ in included],
"sections_dropped": [label for _, label, _ in sorted_sections[len(included):]],
"rule_id": current_rule_id,
})
This lets us measure, after each run:
- How often the budget was approached
- Which sections got dropped when it was tight
- Whether the fix is working (no more HTTP 400s)
Open question: shrink or split?¶
If a rule's episodic history is legitimately long and valuable, do we:
- (A) Truncate / summarize older entries (current proposal), or
- (B) Split the inner loop into two LLM calls: one to summarize history, one to propose the next action?
Option A is simpler but loses information. Option B preserves information but doubles the LLM cost per attempt. For the current run where attempts take 30–60s each and we're time-budgeted anyway, doubling LLM calls per attempt is a big cost. Start with Option A, evaluate.
Testing the fix¶
- Synthetic oversized input. Construct a fake episodic memory with 30 attempts worth of text, run assembly with budget 8K, verify truncation behaves deterministically and the output is under budget.
- Regression test against overnight run. Replay the overnight run's state at each overflow point, verify the new assembler produces prompts under 10K tokens for each.
- Empirical run. After Layer 1 + 2 of
improvements/02and this fix are deployed, run for 2 hours and confirm zero HTTP 400 errors in the run log.
Estimated effort¶
est_tokens+assemble_promptfunctions: 30 minutes- Distilled lesson Reflector output + parsing: 30 minutes
- Rewiring Worker/Architect/Reflector prompt assembly to use the new assembler: 1 hour
- Instrumentation + validation: 30 minutes
Total: ~2.5 hours.
Relationship to other improvements¶
This fix composes cleanly with improvements/02-worker-single-action-enforcement.md
and improvements/01-architect-reengagement.md. Both of those fixes
reduce the need for aggressive context budgeting (fewer internal retries =
less in-turn accumulation; architect re-engagement means fewer deep inner
loops). The budget enforcer is the belt to their suspenders: even if those
fixes partially regress or new agents get added, the budget will catch
overflows before they become HTTP 400s.