Memory Is Not Retrieval: Designing Long-Running AI Agents
Since the introduction of OpenClaw, the use of local AI agents has expanded rapidly. The possibility of installing an agent locally, giving it access to personal context, and allowing it to act on your behalf is powerful.
As these agents evolved, more use cases were added. More skills. More memory. More injected context. The prevailing assumption has been that the more context the model sees, the better it will perform.
That assumption holds for short interactions.
For long-lived agents, it eventually breaks.
Most system prompts inject information from multiple sources in order to give the model as much context as possible. The idea is sound when interactions are brief. But when the model is expected to operate across extended sessions, the same strategy creates instability.
Before tools like OpenClaw, our interaction with LLMs was largely confined to chat windows. Those sessions were not persistent. They did not accumulate operational state over weeks.
Persistent agents change the nature of the problem.
The failure mode is no longer missing context or weak retrieval. It is uncontrolled accumulation.
The Actual Problem With Long-Lived Agents
There is a fundamental difference between a chatbot session and an agent that persists across days or weeks.
A chatbot conversation is transient. A question is asked, an answer is given, the session ends. Context management is trivial because context never accumulates long enough to degrade.
A persistent agent is different.
As context fills with prior messages, loaded documentation, tool definitions, and accumulated state, the model’s ability to focus on the current task degrades. Not because the window is full, but because the signal-to-noise ratio shifts.
The relevant information is still present, but it competes with everything that no longer matters.
This is not a model intelligence problem.
It is an attention allocation problem.
A half-empty context full of signal will outperform a full context full of noise.
Large context windows delay the issue but do not eliminate it. For local models running with 8k–32k tokens, context discipline is not optional. It determines whether the agent remains coherent or begins to drift.
If an agent is meant to persist across sessions and accumulate knowledge, what lives in its context must be actively controlled.
That is the constraint this system was built around.
Two Symptoms of the Same Cause
Running a persistent agent across dozens of sessions revealed two recurring patterns.
The first was context bloat. Every session loaded identity files, skill definitions, documentation, and memory. Most of it was irrelevant to the current task. The context window filled with material that had once been useful but was no longer necessary.
Meanwhile, the space available for actual reasoning shrank.
The second was memory fragmentation. Knowledge accumulated without structure. Infrastructure credentials lived next to transient debugging notes. Related insights were split across dates. Nothing expired. Over time, the memory store itself became noise.
These are not exotic edge cases. They are structural consequences of unbounded accumulation.
And they compound: fragmented memory increases context bloat, and bloated context makes retrieval less reliable.
The underlying issue is not retrieval sophistication.
It is accumulation without maintenance.
The Ideal Mode of Operation: Near-Zero Overhead
The design target is simple:
The always-loaded context overhead should approach zero.
Not zero total context, but zero unnecessary context.
If identity, skills, and memory can be represented in a way that costs almost nothing unless actively needed, then most of the window can be reserved for reasoning and task execution.
Achieving this is not a summarization exercise. It requires treating each injected category differently:
-
Stable identity instructions
-
Semi-stable skills
-
Volatile memory
Each has different lifecycle properties. Each requires a different lever.
Identity and Instruction Files
Files like IDENTITY.md and SOUL.md are stable. They rarely change across sessions.
The mistake is not loading them. It is loading them verbatim.
Consider a typical identity file. It might contain paragraphs like:
You are a personal assistant named Jarvis. You should always respond in a professional but friendly tone. When the user asks you to perform a task, you should first confirm your understanding of the task before proceeding. You value efficiency and clarity in your responses…
An LLM does not need this phrasing. It needs the constraints. The same behavioral directives can be expressed as:
Name: Jarvis. Tone: professional, friendly. On task request: confirm understanding before executing. Priorities: efficiency, clarity.
The first version costs roughly four times the tokens. Both produce the same behavior from the model.
A structural compression pass applies this transformation systematically: collapse prose to declarative statements, remove redundant phrasing, preserve all concrete rules and prohibitions. An 800-token identity file reduces to 300 tokens without losing any behavioral constraint the model would act on.
Compression here is sufficient because the information is stable and always relevant. There is no need for selective loading or expiration. The file is always needed, so the only lever is making it cheaper.
Skills: Semi-Stable, Selectively Loaded
Skill definitions are different. They change infrequently, but only a subset is relevant in any given session.
An agent with ten skills might have skill documents totaling 8,000 tokens. If only one or two skills are relevant to the current task, the other 6,000 tokens are pure overhead.
Loading all skills every time is wasteful. But loading none means the agent cannot discover its own capabilities.
The solution is twofold.
First, compress each skill document to strip human scaffolding. A 1,200-token skill file that includes three code examples, a paragraph of explanation per command, and an ASCII architecture diagram can be reduced to 500 tokens by keeping one example per section, removing the diagrams, and collapsing the prose to single-line descriptions. The model loses nothing it would act on. It loses the pedagogical framing that a human developer would need but an LLM does not.
Second, load skills selectively. Each skill declares trigger keywords in a manifest. When the user’s request or the current task context contains those keywords, the matching skill is loaded. A request mentioning “pod” or “kubectl” triggers the kubernetes skill. A request about “invoice” or “bank” triggers the finance skill. If no trigger matches, the skill stays out of context entirely.
The manifest itself is small. A line per skill with its name and triggers costs a fraction of loading even one full skill document. The agent sees what capabilities exist and can request a skill explicitly if needed, but the default is to load only what is relevant.
Compression reduces per-item cost. Selective loading reduces item count. Together, an 8,000-token skill inventory might cost 1,500 tokens in practice for a typical session.
Memory: The Volatile Component
Memory is where the real complexity lies.
It grows every session. New facts are added. Old ones become stale. Related entries fragment across dates.
If every interaction is recorded and blindly injected into subsequent prompts, memory becomes the dominant source of context inflation.
The solution is a structured store with explicit write-time filtering and an auto-generated index.
The store is a SQLite database. Each memory entry contains a unique key, a topic, a one-line summary, the full content, and optional expiration metadata. The schema also includes an access log table that records every time an entry is retrieved, which becomes important later for signal density tracking.
From this store, the system generates a topic-grouped index file. The index looks like this:
## infrastructure
- ha-access: Home Assistant access credentials and SSH config
- k8s-cluster: Production cluster endpoints and namespaces
- db-creds: MySQL connection strings for ECM databases
## workflow
- agent-behavior-rules: Communication style, cost awareness, delegation policy
- pr-review-checklist: Required checks before merging to main
## finance
- fixed-costs: Monthly recurring charges and payment dates
- bcp-access: Bank portal credentials and MFA setup
Each line costs a few tokens. At a few hundred entries, the entire index costs roughly 500 tokens. That cost becomes predictable and bounded.
This index is loaded every session. The agent sees the full inventory of what it knows. When it needs the details of a specific entry, it calls a tool to retrieve it by key. The full content is only loaded into context when explicitly requested.
The filtering happens at write-time, not retrieval-time.
A Traditional Alternative: Vectors
The default answer to persistent memory is a vector store: embed everything, retrieve by similarity.
That approach excels when dealing with thousands of heterogeneous documents where relevance cannot be anticipated.
But in this case, the memory store contained a few hundred structured entries. Each was written deliberately with a descriptive key, a topic, and a summary.
Under those conditions, LIKE search in SQLite is transparent and sufficient. There is no embedding model to version, no similarity threshold to tune, and no ambiguity about why a given entry was retrieved.
The tradeoff is clear: keyword search is weaker at fuzzy recall.
But when memory is curated at write-time, fuzzy recall becomes less important than precision and predictability.
Write-Time Filtering
This is the most consequential design decision.
Instead of recording everything and relying on retrieval to sort it out later, the system applies judgment at the point of writing. The agent itself decides what deserves persistence.
During a session, the agent might check the status of a Kubernetes cluster, debug a failing service, read several log files, and discover that a configuration value needs to change permanently. Of all that activity, only the last part is durable.
What gets stored is the configuration change: a new entry with a descriptive key, a topic, and the specific values that matter for future sessions.
This distinction is not enforced by the storage layer. It is enforced by the agent’s own judgment about what constitutes a durable fact versus a transient observation.
When a memory is written, the index is regenerated automatically. The agent always sees the curated inventory on its next session.
This inverts the usual retrieval pattern.
In a typical RAG setup, everything is recorded and the retrieval system decides what is relevant at query time. Here, recording itself is selective, and the agent decides what to load from the index at session time.
The agent holds the task context. The infrastructure does not.
The decision about relevance belongs where the context lives.
Back Annotation and Consolidation
Even curated memory drifts over time.
An agent writing memories across weeks will inevitably misclassify some entries. Daily session notes accumulate without being merged into durable insights. Related facts written on different days remain fragmented.
Left unaddressed, this drift degrades the index and lowers signal density.
To address this, a nightly consolidation process runs outside interactive sessions. I call it back annotation because it revisits and corrects the store after the fact.
It runs in four phases.
Re-topic scans entries classified as “general” and reassigns them using configurable keyword rules defined in a TOML file.
Auto-merge consolidates entries sharing the same topic and date when multiple related entries appear.
Expire deactivates daily entries older than a configurable threshold (14 days by default).
Metrics snapshots the store state (entry counts, topic distribution, index size, and signal density) and appends it to a JSONL log.
Memory becomes something maintained, not merely accumulated.
Signal Density
Signal density measures how many active memory entries were actually accessed in the last 24 hours relative to the total active store.
If 200 entries exist but only 15 are ever recalled, 92.5% of the index is dead weight.
Without usage tracking, memory stores grow monotonically. With it, inefficiency becomes measurable.
This metric creates a feedback loop. When signal density drops, the store is accumulating noise. Entries can be consolidated or expired accordingly.
Memory systems without usage tracking optimize for completeness.
Persistent agents need to optimize for signal.
Compression: Budget Discipline
Memory management alone is not enough. Static documentation (skills, configuration guides, tool references) often becomes the second largest contributor to prompt inflation.
Skill documentation and configuration files are written for humans. They contain explanatory prose, repeated examples, ASCII diagrams, and YAML frontmatter.
An LLM does not require narrative scaffolding. It requires identifiers, commands, paths, thresholds, and structural contracts.
The compression pipeline operates in three progressively aggressive strategies.
Strategy A is pure Python with no LLM dependency. It strips YAML frontmatter, keeps only the first code example per section, removes ASCII diagrams, collapses prose to structural summaries, and preserves identifiers exactly.
This consistently achieves 30–50% token reduction.
Strategy B uses an LLM to rewrite the document at a target token budget while preserving all literal identifiers.
Strategy C applies a second LLM pass focused on token-level optimization, pushing compression toward 50–70% reduction.
To ensure nothing critical is lost, compressed documents are validated against test queries and required identifier checks.
If an LLM-compressed version fails validation, the system falls back to Strategy A.
Compression reduces per-item cost. Selective loading reduces item count. Back annotation prevents unbounded growth.
Together, they enforce context hygiene.
What This System Is — and Is Not
This is not a general-purpose RAG system.
It does not attempt to index thousands of arbitrary documents.
It addresses a narrower problem:
How do you keep a single persistent agent coherent across weeks of accumulated state?
Retrieval answers: “Given a query, what is relevant?”
Maintenance answers: “Given everything the agent knows, what is still worth keeping?”
Long-lived agents degrade not because they lack retrieval sophistication, but because they lack maintenance discipline.
Once context is treated as a constrained resource that requires active maintenance, coherence becomes sustainable.
That is the architectural shift behind this system.
What Remains: In-Session Compaction
Everything described so far operates between sessions. Compression, selective loading, and consolidation reduce what enters the context window at the start. But within a session, the context still grows.
Tool outputs accumulate. Conversation history lengthens. Intermediate reasoning fills space. Even with a lean starting context, a productive session will eventually approach the window boundary.
At that point, compaction becomes necessary. The system needs to summarize or discard earlier parts of the session to make room for continued work. This is a distinct problem from pre-session context management, and it introduces its own tradeoffs: what can be safely discarded mid-session without losing continuity? How does the agent preserve awareness of decisions made earlier in the conversation? How should this be communicated to the user?
This system does not yet solve in-session compaction. What it does is buy runway. A session that starts with 2,000 tokens of overhead instead of 12,000 has significantly more space before compaction is needed. In many cases, that is enough for the session to complete without hitting the boundary at all.
But for truly long sessions, compaction remains an open problem. It will need to be handled autonomously, with minimal disruption, and with enough transparency that the user understands when the agent’s working memory has been compressed mid-conversation. That is a design problem worth its own treatment.
The Missing Discipline: Context Maintenance
Most discussions about agent memory focus on retrieval.
How do we search a large body of information? How do we embed documents? How do we rank results?
Those are valid problems when the knowledge base is large and heterogeneous.
But a long-lived agent’s working memory is different. It is small, structured, and actively written by the system itself. The challenge is not discovering information in a large corpus. The challenge is preventing the system from accumulating context faster than it can remain useful.
In traditional systems engineering, long-running systems require maintenance loops.
Logs rotate. Caches expire. Databases compact themselves. Metrics reveal drift.
Without these processes, systems accumulate state until they become unstable.
Long-lived LLM agents are no different.
If every piece of context is preserved indefinitely, the agent eventually drowns in its own history. The model becomes less coherent not because it lacks intelligence, but because its working memory is no longer disciplined.
The architecture described here treats context as a constrained resource.
The result is not a smarter model.
It is a system that protects the model’s attention.
And in long-running agents, protecting attention is what keeps them useful.
The system is open-source and built on stdlib Python with zero required dependencies: context-engine.