Context Graphs and the Memory Question¶

The research spiral started with a podcast episode and ended with SQLite. In between: Foundation Capital's trillion-dollar thesis about decision provenance, a November 2025 paper where a 4B model with graph memory outperformed an 8B model without, a NIST policy initiative that reframed audit logs as compliance artifacts, and the eventual realization that none of it actually determined the storage substrate. What did was a feature I hadn't built yet.

I almost added PostgreSQL. I almost went with plain JSON files. I ended up at SQLite — not because of the context graph, but because the adaptive concurrency controller I hadn't built yet required concurrent read/write that only a proper database can provide safely. When a theoretical argument and a practical requirement converge on the same simple answer, you're probably in the right place.

The spark: context graphs¶

The insight came from an AI Daily Brief podcast episode where Nathaniel Whittemore discussed a concept that stuck: in the future, the knowledge about why a decision was made — the reasoning chain, the information provenance, the alternatives considered — may be more valuable than the output itself.

This connected to Foundation Capital's "Context Graphs: AI's Trillion-Dollar Opportunity" (December 2025), which argues that the next trillion-dollar platforms won't capture data — they'll capture decision traces:

"Not the model's chain-of-thought, but a living record of decision traces stitched across entities and time so precedent becomes searchable."

The thesis: agents don't just need rules (what should happen). They need decision traces showing how rules were applied in past cases, where exceptions were granted, how conflicts were resolved. That's the context that makes autonomous agents trustworthy and improvable.

The research validates the instinct¶

Convergent evidence from multiple directions:

Academic: Trainable Graph Memory (arxiv 2511.07800)¶

A paper from November 2025 that builds a three-layer graph for agent memory: 1. Queries (what was asked) 2. Canonical decision paths (how it was solved) 3. Meta-cognitions (strategic principles from contrasting successes and failures)

The edges have learned weights via REINFORCE — the graph literally learns which strategies are most valuable. The headline result: a 4B model with this graph memory outperformed a baseline 8B model.

A 4B model with graph memory outperformed an 8B model without it. The memory is worth more than the extra parameters.

That's this project's thesis in a paper. Small edge model + right harness + accumulated knowledge = effective solutions.

Regulatory: NIST AI Agent Standards Initiative (Feb 2026)¶

NIST launched the first US government program for autonomous AI agents. They require: - Chain-of-custody logging for autonomous operations - Provenance of prompts and data sources - Audit trails linking agent actions to their reasoning

OMB Memoranda M-25-21 and M-25-22 classify agentic AI as "High-Impact AI" subject to governance requirements. SP 800-53 control overlays for agentic systems (COSAiS) are in development.

For a Federal-facing demo, decision provenance isn't a nice-to-have. It's becoming a compliance requirement.

Industry: The Memory Landscape in 2026¶

Mem0's State of AI Agent Memory 2026 confirms graph memory is emerging but not yet dominant. Graph approaches achieve marginally better accuracy (68.4% vs 66.9%) on complex reasoning but at 2x latency. The industry is transitioning from vector-first to graph-aware architectures.

ICLR 2026 hosted a dedicated MemAgents Workshop on memory for agent systems. This is now a recognized research area, not a fringe interest.

The LangChain "Your Harness, Your Memory" article argues that agent harnesses are permanent infrastructure, not scaffolding. Memory is intrinsic to harness design. Whoever controls the harness controls the memory, and that's the real lock-in.

The database question: three rounds¶

Round 1: PostgreSQL¶

The first instinct was PostgreSQL via docker-compose. Clean schema, proper foreign keys, SQL queries for cross-run analysis. A four-table schema would model the full context graph:

runs          -- one row per harness execution
work_items    -- nodes in the task graph, linked to runs
attempts      -- decision traces: approach, evaluation, reflection
lessons       -- cross-run meta-cognitions with learned weights

The problem: PostgreSQL adds a real dependency. Anyone who clones the repo needs Docker, needs to wait for the container, needs to handle "what if the DB is down." For a single-user exploration project on a home lab, that's heavy.

Round 2: JSON files¶

The devil's advocate: does this even need a database? The research says file-based memory works at this scale:

What's needed	File approach	Scale threshold
Cross-run lessons	`memory/lessons.json`	Works under ~1,000 entries
Banned patterns	`memory/banned.json`	Works indefinitely
Difficulty model	`memory/difficulty.json`	Works fine
Audit trail	JSONL event logs (already exist)	Works at any scale
Prior attempt queries	Scan run JSONL files	Works under ~50 runs

Industry analysis confirms: Manus, OpenClaw, and Claude Code all converge on "memory as documentation" — files in the workspace. The consensus threshold: files work under ~5MB of memory data. The distilled memory here (lessons, bans, difficulty model) would be well under 1MB.

The problem: the adaptive concurrency controller.

Round 3: The clutch changes everything¶

The "clutch" mechanism — adaptive concurrency based on learned difficulty and observed GPU utilization — needs to:

Read difficulty estimates while workers are running (to decide whether to spawn another worker)
Write updated estimates as each worker completes (the success or failure changes the model)
Both at the same time — worker 1 finishes and writes while worker 2 reads to decide if worker 3 should start

That's concurrent read/write. JSON files can't do this safely — one write clobbers another. PostgreSQL can, but it's overkill.

SQLite with WAL mode is the answer: - Zero dependencies — sqlite3 is in Python's standard library - Zero configuration — no server, no container, no docker-compose - One file — memory/gemma_forge.db - Proper concurrent access with Write-Ahead Logging - SQL queries for aggregation and cross-run analysis - Portable — clone the repo, the DB creates itself on first run

The SQLite decision didn't come from the context graph. It came from the clutch.

The v5 architecture¶

What I'm building¶

Decision graph schema — node/edge types as Python dataclasses, serialized to SQLite. The graph structure maps the existing JSONL event types into persistent, queryable relationships.
SQLite persistence — memory/gemma_forge.db with tables for runs, work items, attempts, and strategic lessons. Created automatically on first run. Survives across runs.
Cross-run retrieval — before each attempt, the harness queries: "What approaches were tried for items like this in prior runs? What worked? What was banned?" Prior knowledge is injected into the architect's context.
Adaptive concurrency (the clutch) — the harness reads learned difficulty estimates to set concurrency. Easy categories (95% first-try success) get parallel workers. Hard categories (low success rate) stay serial. GPU utilization gaps inform how many workers to spawn. The clutch adjusts mid-run as new data arrives.
NIST-aligned audit export — render the decision graph as a compliance artifact. Every autonomous action has provenance back through the reasoning chain to the evidence that informed it.

The interface boundary¶

class MemoryStore(Protocol):
    """Abstract memory persistence — SQLite now, upgradeable later."""
    def load_lessons(self, category: str) -> list[Lesson]: ...
    def save_lesson(self, lesson: Lesson) -> None: ...
    def load_difficulty_model(self) -> dict: ...
    def update_difficulty(self, category: str, outcome: str): ...
    def query_prior_attempts(self, item_id: str) -> list[Attempt]: ...

Same pattern as the skill interface extraction: abstract the concern, implement with the simplest viable backend, upgrade only when forced. If someone deploys this at enterprise scale with 10,000 runs, they swap SQLite for PostgreSQL. The harness never knows.

Looking back at the arc¶

It's worth stepping back and noticing what happened over the past week. The harness started as a retry loop. Then it learned to reflect. Then it learned to remember within a run. Then it became skill-agnostic. And now it learns across runs — each execution leaving the next one smarter, with every decision traceable.

None of this was planned from the start. Each version emerged from running the previous one, honestly analyzing the failures, and asking "what would make this better?" That's the Ralph loop applied to itself.

Sources consulted¶

journey/19 — the first research pass that validated the v3→v4 choices.
journey/20 — the interface pattern being extended to memory persistence.
journey/21 — the task graph that the context graph enriches with decision provenance.

Context Graphs and the Memory Question¶

The spark: context graphs¶

The research validates the instinct¶

Academic: Trainable Graph Memory (arxiv 2511.07800)¶

Regulatory: NIST AI Agent Standards Initiative (Feb 2026)¶

Industry: The Memory Landscape in 2026¶

The database question: three rounds¶

Round 1: PostgreSQL¶

Round 2: JSON files¶

Round 3: The clutch changes everything¶

The v5 architecture¶

What I'm building¶

The interface boundary¶

Looking back at the arc¶

Sources consulted¶

Related¶