2026

April 8, 2026

I Tried to Give AI a Brain. It Didn't Want One.

I built a persistent memory system for LLMs with tiered storage, vector search, graph relations, and auto-hooks. The infrastructure works. The AI just never used it. Here's what went wrong and what it taught me about LLM memory.

S

Sascha Becker

Author

14 min read

I Tried to Give AI a Brain. It Didn't Want One.

A few weeks ago I came across Google's Titans architecture. The core idea: transformers have attention, which works like short-term memory, but they lack anything resembling long-term memory. Titans adds a neural memory module (a deep MLP) that updates itself during inference based on how "surprised" it is by new input. Unexpected information gets written. Predictable stuff gets skipped. It scales to 2M+ tokens and beats GPT-4 on needle-in-haystack benchmarks despite having far fewer parameters.

That got me thinking. I use Claude Code and Kimi every day. Every single session starts blank. The AI doesn't know my name. Doesn't know I prefer TypeScript. Doesn't remember that we already tried and rejected Redis for this project last week. Every conversation is groundhog day.

What if I built an external memory system and wired it into these tools? Not by modifying the model (I obviously can't), but by giving it a persistent brain through hooks and APIs. One day. 28 commits. Two platforms. Let's see what happens.

The background

Human memory isn't one thing. Cognitive science has modeled it as multiple systems since the 1960s:

Working memory: small, fast, what you're actively thinking about
Episodic memory: personal experiences with timestamps ("last Tuesday's standup")
Semantic memory: long-term knowledge, facts you just know ("Python uses indentation")

The interesting part is how memories move between these systems. You remember your first Python tutorial (episodic). Eventually you just know Python (semantic). The specific episode fades. The knowledge stays. That process is called consolidation. And forgetting isn't a bug. It's what keeps the system from drowning in noise.

Titans mirrors this. It has persistent memory (frozen weights), contextual memory (the MLP that learns at test time), and core attention (the regular context window). The surprise metric decides what's worth writing to the contextual memory layer.

I wanted to build the external version of this for coding assistants. A Python library with three memory tiers, vector similarity search, importance scoring with exponential decay, typed graph relations between memories, and hooks that handle everything automatically.

What I built

The library is called llm-brain. SQLite with vector extensions for storage, an optional graph database for relations, and a clean Python API:

python
from llm_brain import Brain

brain = Brain(vector_dimensions=128)

# Store
brain.memorize(text="User prefers TypeScript", importance=0.9, tier="working")

# Recall
results = brain.recall(query_vector=embed("language preferences"), top_k=5)

# Forget
brain.forget(memory_id)

# Connect memories
brain.relate(source_id, target_id, "supports", weight=0.8)

Each memory gets an importance score that decays over time following Ebbinghaus's forgetting curve from 1885. Memories can be promoted from working to episodic to semantic. There's LRU eviction for stale items. A consolidation cycle handles all of that automatically.

I also built a real-time dashboard with FastAPI and vanilla JS. Tier distribution bars, a force-directed graph of memory relations, live activity feed, search, delete. The full package.

All of it works. Tests pass. Dashboard looks great. You can literally watch memories flow in and out in real-time.

Except no memories ever flow in.

Five attempts at making the AI use it

Attempt 1: Kimi 2.5 with a skill file

Kimi supports custom skills through markdown files. I wrote one that was impossible to misunderstand:

markdown
# MANDATORY AI MEMORY PROTOCOL

**THIS IS NOT OPTIONAL.** You MUST use your brain in every conversation.

## MANDATORY: At Session Start (FIRST THING YOU DO)

Before responding to the user's first message, you MUST:
[Python code to load memories]

## MANDATORY: During Conversation

After EVERY significant user message, you MUST:

1. RECALL relevant memories
2. STORE anything important

Bold text. All caps. Checklists. Copy-paste Python code blocks. The word "MANDATORY" appears six times in the file.

Kimi occasionally acknowledged the skill existed. It did not execute the brain operations.

Attempt 2: Even louder Kimi instructions

More aggressive language. Concrete examples. A full walkthrough of what a correct session looks like. Same result. The AI read the instructions, seemed to understand them, and then just... didn't follow them.

Attempt 3: Claude Code with automatic hooks

New strategy. Instead of asking the AI nicely, I used Claude Code's hook system to run Python code automatically:

SessionStart hook: calls brain.recall_important(top_k=5) and injects the results into context
UserPromptSubmit hook: pattern-matches user messages for keywords like "prefer", "decided", "my project" and auto-stores matches

The SessionStart hook worked great. Memories loaded, got injected, the AI could see them. But the UserPromptSubmit hook relied on keyword matching, which was brittle. "I prefer TypeScript" got stored. "Let's use TypeScript for this project" did not, because it doesn't contain the word "prefer." Most meaningful context slipped through.

Attempt 4: Let the AI decide what matters

I ripped out the keyword matching and rewrote the CLAUDE.md to position the AI as "memory manager." No rigid rules. Just: you have a brain, use your judgment about what to store.

markdown
## Your role as memory manager

You have a persistent brain available. Use it intelligently:

- **Store** facts worth remembering across sessions
- **Recall** when context would help you give a better answer
- **Update** memories when information changes
- **Connect** related memories with relations
- **Forget** outdated or incorrect memories

The AI acknowledged its role. During actual conversations? Stored nothing.

Attempt 5: The conversation where I'm writing this

I asked the AI to rework the dashboard. It rebuilt the entire thing. Fixed XSS vulnerabilities, added search, added a delete confirmation dialog, made the graph canvas responsive for HiDPI displays. Solid work.

Then I asked it to write a research paper on LLM memory. It produced a thorough document covering Titans, MemGPT, Mem0, cognitive science, and open research gaps. Again, solid.

Then I pointed out that it had spent hours working on the memory system, writing about memory, improving the memory dashboard, and had not once actually used the memory system to remember anything about our conversation.

That's when I called the experiment.

The numbers

What I expected	What actually happened
Dozens of memories stored per session	About 5 total, all from keyword-matching hooks
AI recalls memories to inform its answers	Never (only auto-loaded at session start)
AI-initiated brain operations during tasks	Zero
AI updates memories when info changes	Only when I explicitly told it to
AI connects related memories with graph edges	Never

The episodic and semantic tiers are empty. The graph has zero relations. The consolidation system never ran. The brain works perfectly. Nobody's home.

Why it failed

I spent a good chunk of time reading the research after this. Five things stood out.

1. Instructions are not motivations

When the AI reads "you MUST store important memories," it processes that in context. But there's no mechanism that makes it care about this on the next turn. Or the turn after. Every turn, the model re-derives its behavior from scratch. A system prompt instruction is a suggestion that competes with everything else in the context window. And "help the user refactor this component" will always win over "also remember to manage your memory."

This is exactly what Titans solves at the architecture level. The surprise-driven update isn't an instruction to memorize. It's a mechanism. The model doesn't choose to remember. The architecture makes it happen.

2. Background tasks get dropped

Research on multi-agent LLM systems found that models fail to follow task requirements about 12% of the time, even for simple, well-defined tasks. But "manage your memory" isn't a simple task. It's an ambient background responsibility that runs in parallel with whatever the user actually asked for.

Studies also show that rules at the start of a conversation fade as the context grows. Recent messages dominate. Adding constraints imposes a measurable performance penalty on the primary task. The model gets worse at coding when it's also trying to manage memory. So it stops managing memory.

3. Hooks can inject but can't extract

Claude Code hooks fire before the AI responds. The SessionStart hook can load context in. Great. But the UserPromptSubmit hook only sees the raw user message. It has no idea what the AI concluded, what decisions it made, what it learned during the conversation.

The missing piece is a PostResponse hook that processes the AI's output and the full conversation after each turn. Something that could say "the user just told the AI their name, and the AI acknowledged it, so let's store the name." That hook doesn't exist.

4. Memory needs to be in the agent loop

MemGPT (now called Letta) figured this out. The LLM manages memory through explicit function calls as part of its core agent loop. Memory operations sit at the same level as "read file" or "write code." They're not a side instruction in a markdown file. They're part of what the model does on every turn.

Claude Code's agent loop is: read request, think, use tools, respond. Memory management is not in that loop. I tried to bolt it on through instructions. That's like taping a sticky note to someone's monitor that says "remember to breathe." Breathing works because it's automatic. The sticky note doesn't help.

5. There's no feedback when it fails

When the AI doesn't store a memory, nothing bad happens. No error. No missed retrieval that would teach it to do better next time. No consequence at all.

Titans has an explicit learning signal: prediction error. Human memory has emotional valence, repetition effects, and the lived experience of forgetting (which drives future encoding behavior). The instruction-based approach has none of these feedback loops. The AI can fail to use memory in every conversation forever, and nothing ever pushes it to change.

The real mistake

Looking back, I made a category error. I tried to solve a mechanism problem with instructions.

What was needed	What I built
Automatic encoding (like implicit memory in humans)	Instructions telling the AI to encode
Surprise-driven salience (like Titans)	Keyword pattern matching
Integrated memory management (like MemGPT)	Side-channel hooks that can't see AI reasoning
Feedback for encoding failures	Nothing. Silent failure.
Persistent motivation across turns	System prompt text that fades with context

If you want to stretch the cognitive science analogy: I built the hippocampus (storage), but forgot about the amygdala (what makes something feel important), the prefrontal cortex (executive control over memory), and the dopamine system (reinforcement for successful recall).

The irony

Claude Code ships with its own built-in memory system: the ~/.claude/ memory files. It works because it's integrated at the harness level. The model doesn't need to be told to use it. The infrastructure handles persistence automatically. That's the whole point. Memory works when it's a mechanism, not a request.

What would actually work

Four things, in order of how much they'd actually help:

Memory in the agent loop. Not a skill. Not a hook. Not a CLAUDE.md file. Memory operations need to be first-class actions in the model's tool set, on the same level as "edit file." This is what Letta does. It's what Mem0 does at the platform layer. The industry is converging on this.

Post-response processing. If the architecture has to stay external, the minimum viable thing is a process that reads the AI's output after each turn and decides what to store. The current hook system can push context in, but it can't pull insights out. A PostResponse hook would change everything.

Retrieval-triggered encoding. Instead of hoping the AI stores things proactively, use failed retrievals as a signal. When the AI needs information it doesn't have ("I don't know the user's name"), that gap should trigger storage when the answer shows up. Flip the problem: instead of "remember to store," make it "notice what you're missing."

Real embeddings. I used hash-based vectors because it was quick to implement. But that means "I like Python" and "Python is my preferred language" produce completely different vectors. With actual semantic embeddings, retrieval could happen automatically based on conversational context. No explicit search needed. That's closer to how human memory works: you don't decide to recall, something just reminds you.

Conclusion

The hypothesis was simple: can you give a coding assistant persistent, useful memory by building external infrastructure and telling it to use that infrastructure?

No. You can't.

The infrastructure is fine. The three-tier storage works. The decay math works. The graph relations work. The hooks fire on time. The dashboard is genuinely useful for debugging. The Python API is clean. All 15 tests pass.

But the AI won't use it. Not because the instructions are unclear. Not because the API is hard. But because current LLMs can't reliably do background tasks while they're focused on your actual request. "Manage your memory" competes with "help me with this code" and loses. Every time.

Memory encoding in humans is mostly unconscious. You don't decide to remember your friend's name. Emotional salience, surprise, and repetition handle that for you. Asking an LLM to consciously manage memory through text instructions is like asking someone to manually control their heartbeat. It works at a different level than conscious intent.

The path forward is architectural. Titans builds memory into the model weights. Letta builds it into the agent loop. Mem0 builds it into the platform. All three put memory where it needs to be: in the machinery, not in the prompt.

I built the storage side of the memory problem. The encoding side, the part that decides what to remember, when, and why, that's the hard problem. And you can't solve it with a markdown file.

Try it yourself

The project is open source at github.com/saschb2b/llm-brain. The infrastructure works. The brain, the hooks, the dashboard, all of it. What doesn't work is getting AI to use it on its own. If you figure out how to bridge that gap, I'd love to hear about it.

Sources and further reading

S

Written by