Vlad Pomogaev

Home

How RAG + AI Can Mimic the Human Brain with the Help of Reservoir Computing

Feb 24th 2025

...if you squint your eyes and forget every piece of biology you know, the RAG-LLM combination looks awfully like a brain

RAG, or Retrieval Augmented Generation, is a good technique to provide large language models with the context they need to answer questions. However, RAG has some problems, primarily in its implementation and how it's integrated into the overall AI system as a whole. With some minor changes, I think that RAG could better model the cognitive processes that occur in our own brain, and lead us closer to AGI.

For starters, let's look at how RAG usually works. When you perform RAG, typically, you take the question that the language model (LM) is supposed to answer, convert it into a vector using an embedding model, and perform a direct vector search or an approximate nearest neighbor search for relevant statements in your text corpus or database. You then take the best matching text and reorganize it, adjusting the order of statements, possibly with an LM or some other method, and feed it into the context window of an LM tasked with answering a particular question. The thinking here is that the LM will now be given, very directly, the context it needs to answer a question or perform some task. We're trying to shoving context down an LLM's metaphorical throat.

This is great in theory, and I think it highlights the real functions that LMs are supposed to do. They are supposed to take some stored knowledge and simply perform a minor contextual expansion. Predict the next token, if you will. If you're given factoids as part of RAG into an LM, and the LM is asked a question, the LM's function is to simply predict the next token, which would be an answer to the question given the factoids provided above.

I think about AI in general in a different way from other people because I started researching AI not as a technique for searching for information or generating the next token. I thought a lot about how digitizing the human brain is a surefire way to get us to AGI. One of the things I learned by studying the human brain is how the hippocampus works. One model, promoted by many, but more thoroughly modelled by a mathematician named Eugene M. Izhikevich, is that the cortex, or the sheet of neurons on the outside of the brain, performs two main duties:

Long story short, there's a theory that the hypothalamus is essentially performing the function of an 'if-else' statement within your brain. The hypothalamus receives input from the cortex through the striatum, and the striatum is kind of like the 'if' statement. Specifically, it's the synaptic connections between the cortex and striatum that are the 'if' statement. If you are thinking of something, the state of the cortex is looked at by the striatum to perform an 'if' statement. Then the connections from the striatum eventually get to the dorsal hypothalamus, forming what is the 'else' statement. The thalamus performs a transformation to the state of the cortex, possibly by synchronizing or desynchronizing very specific neurons; causing you to think about something else, or combine ideas that you are currently thinking about. If you perform enough similar transformations to the state of the cortex, the cortex begins performing these transformations by itself via STDP (spike-timing-dependent plasticity). Only then have you truly remembered something and "engrained it" in your brain.

The 'if-else' model was also popularized by Chris Eliasmith and the folks from his lab. Their idea was that the state of the brain and all of it's neurons can be represented by a vector or set of vectors, and you can essentially create a system of vector transformations that learns to perform cognitive tasks. I would say they have succeeded, their cognitive models can perform tasks like inductive learning, pattern continuation, and more. His book, How to Build a Brain, is not only a super interesting take on how the brain could work, but it's also loaded with ideas about how to apply the knowledge of the brain towards constructing intelligent systems. Here's a book review. Their results are truly incredible and need to be talked about more. That being said, I think Izhikevich's thoughts on the matter more closely mirror reality when he models (just like a mathematician would) the neuron as oscillators, and notes with Hoppensteadt that oscillators can synchronize. I think he is on the right track when he says that the thalamus likely synchronizes neurons in the cortex, they can selectively transfer information. This is basically a form of FM communication between neurons, and closely matches many people's idea that neurons have evolved to perform the most amount of information transfer/computation per molecules of ATP.

In other words, if you have some state in your cortex, the job of this cortex-striatum-hypothalamus loop is to adjust the state of the cortex in a systematic way. It performs the function of "thought" or "consciousness".

RAG in an LM system are completely different paradigms from how the brain works. But, if you squint your eyes and forget every piece of biology you know, the RAG-LLM combination looks awfully like a brain. You could look at the LM like the striatum and the dorsal hypothalamus, RAG as long-term state (synapses in the cortex), and the context window as the short-term dynamics of the cortex. There's a few problems with this analogy, which I will expand upon later.

One thing to note in this analogy is that every time you generate a token, you're changing the state of your brain systematically. The problem is that LMs don't perform much thought before answering a question. They don't change the state smartly, and take a very greedy approach towards answering the question. These systems basically store the knowledge-answering capabilities inside the LM itself, and you're asked to immediately answer the question without much thought, which is difficult unless the information is well-represented in their training dataset... or is basically given to you in the RAG retrieval context, which implies that the LM+RAG system is already "thinking" of something very close to a solution.

One thing we've seen recently with deep reasoning models, like R1, is that the more time you can encourage an LM to think about a question, the better its reasoning capabilities get. A fair tradeoff, as long as it's scalable. And deep-reasoning models weren't the first to do this; manual chain-of-thought prompting techniques or multi-step/multi-hop question answering does the same thing. But CoT is very manual, task-specific, and can be viewed as a crutch for LLM's reasoning abilities, while reasoning models do the same thing, but they have the reasoning embedded into the model.

If there's one form of information that I think would be "fair-game" to scrape from the internet and use to train an LLM with, it's not the books and images and copyrighted media. It's the thought-processes and transformations to ideas that are independent from the content itself. I think reasoning models based on reinforcement model get us closer to that.

Now, back to RAG. Where does it come into this equation? Retrieval augmented generation is meant to augment or fill in knowledge gaps in your LM. Regardless of how you choose to perform retrieval augmented generation, you need to put the output from your RAG system into the context window of an LM. You are adjusting the state, forcing the state of your LM to be a certain way at specific points during your AI pipeline.

At the start, to answer the question, you might fill in a blank context window, and at the top, you may have the question you need to answer. RAG then fills in examples of factoids or bits of information you need to answer your question. It takes from some other space, long-term storage, which is factual or considered factual, depending on your information source, and provides it into the short-term state of your LM.

There are a couple of problems with this. First, the way you perform RAG typically doesn't instantly get you the information you need to answer the question. In RAG, if your answer is well-represented, it's easy. You take the vector that represents your question, perform a search, and get the best answers. But often, the question is a question, and the RAG database contains answers, so you're bound by your embedding model's capabilities to connect the dots between the question and the answer.

If you have a general corpus where your training data set has been scraped from the internet, you don't have very nice factoids in your dataset. Your embedding model might choose information that seems relevant at first because it's similar in terms of text content, but actually doesn't answer the question or is the other way around. We've seen this on Reddit, where people very confidently answer a question with the wrong answer for comedic effect. It's a joke answer, but the LM doesn't know that. The LM doesn't perform any thought before using a clearly false factoid or statement to answer the question. That's how Google ends up telling you to put weird stuff on your pizzas...

One way to get around this discrepancy between question and answer is to feed it through an LM first. Augment your dataset by computing the questions your factoids can answer. If you have a graph representing all of your knowledge, you want to connect the knowledge together using the LM. In practice, you'd pass all the information in your corpus through an LM and ask it what questions each factoid answers. You rely on the LM's ability to generate questions from the given facts. Then, you vectorize those questions and perform RAG on the questions as keys (rather than the values as keys) that the statements answer rather than the statements themselves. This is a simple way of solving this issue.

However, you'll still run into issues where your LM hasn't defined all possible questions that a statement can answer. Hallucinations. A user of your AI system might ask, "When was the Statue of Liberty built?" Google shows it as being from 1876 to 1886. An obvious problem is the statue was built in France first, shipped to New York City, and then assembled there. That's why there's ambiguity about the date. Your LM might not understand the context with the RAG technique and might just answer, "It was built in the 18-whatever," because that's likely what your system matches first. If you include more samples, maybe a smart reasoning model with enough default knowledge on the Statue of Liberty might provide additional context.

Another problem for RAG is that AI models function as next-token predictors or generators. Reasoning models could be made cheaper if we didn't include embedded knowledge and instead taught it the transformations required for average question answering using RAG as a support, rather than embedding all necessary information within the model. Practically what I'm advocating for is way larger contexts, longer CoT processing, systems built with RL with RAG built-in. So your system, during training, has the ability to "Google" things, and it is rewarded for Googling the right things, or things that are relevant to the right things, and becomes adept at doing research and decomposing tasks into smaller subtasks. Longer context allows deeper CoT, and focusing on training the "research" aspect of the tasks will help keep the size of the model small while we increase the model to increase the context window.

To summarize, the idea is that you want an agent-style AI system to answer questions after considering the question thoroughly and thoughtfully querying a RAG system from first-principles. Another problem is that existing frameworks for querying systems are very strict on the types of queries and prompt sequences you can use.

For example, there's a framework called DSPy, which I believe stands for Demonstrate-Search-Predict? Python? I'm still not sure. It is a prompt optimization and automation framework. It allows you to program your LM by automatically adjusting and fine-tuning prompts rather than doing that manually via prompt engineering. This framework allows a human to introduce a probabilistic bias in the thought process and techniques an AI uses to solve a problem. However, it defines these transformations or steps very formally.

In the DSPy framework, you're supposed to define modules with well-defined interfaces that perform well-defined functions by prompting an LM. The selling point is that this is very similar to traditional programming, which people are already familiar with. The system can optimize those prompts through a few-shot learning or bootstrapped examples to get you the needed answers. There's another framework called TextGrad that does something similar, but it still breaks your AI's thought process into discrete steps.

The first step might be to look at RAG content and summarize the information. Next, think about the question in the context of the statements. What other questions could you be asking? Query your RAG system and repeat the previous steps until the answer is obvious. Then, finally, answer the question, formatted appropriately. While this deterministic system approach can work, it doesn't allow you to expand your text corpus meaningfully unless your chain of prompts is very long and self-checking, and gets you very far away from the original idea-space of your corpus.

Despite AI agents saving states and querying functions from classic software services, they remain rigidly defined. The human brain doesn't operate through sequential Python statements. It's improvising and changing its long-term state to learn new things, retaining what feels relevant.

We need to overcome this by creating a probabilistic RAG system that learns to show certain content as the RAG system is used. Leveraging an LM can train the system to ensure it provides correct information and auto-query until the LM is sure of the correct response. This deterministic or probabilistic model is like writing a proof in mathematics, stating assumptions, and performing transformations to convince itself of correct reasoning.

Reservoir Networks

One way could be applying a reservoir network concept to retrieval-augmented generation. A reservoir computer, as a concept, isn't widely found online or at least it is overshadowed by deep learning research.

Imagine a pond where the water is still. A set of sensors measure the water's height in a grid in the pond, and a rock thrown into the pond creates ripples. Those ripples depend on factors like rock size, throwing direction, spinning, shape, or density, perturbing the system and causing ripples that travel across the surface.

Observers analyze these ripples' by recording the height measurements over time, extracting information about the rock. If your pond has greatly variable depth, or a complex underwater surface, these introduce nonlinearities into your system, which make the sensor outputs more distinct from each other.

If you have a complex enough "pond", the sensor data becomes mostly linear, pulling non-overlapping property components from the rock. It's similar to vectorization of input, like using an embedding model to transform sentences into vectors representing that input.

In the pond analogy, sensors generate vectors from their measurements post-rock contact, extracting possible information. If enough sensors are used, their data remains linear, obeying the superposition principle, allowing simple, linear training networks to perform regression or classification on the rock's properties.

In the human brain, the cortex has some resemblance to a reservoir computer system's functions but with differences. It receives input from the thalamus and linearizes impulse components into individual neuron firings, considering firing times across disparate neurons.

As in the pond, overlapping waves or impulses change a system's state, contributing to working memory, planning, and other cognitive functions, largely associated with the frontal cortex.

Both the hippocampus and reservoir neural networks continue creatively solving computing tasks affordably. Using nanomaterials like carbon nanotubes, they offer energy-efficient alternative solutions compared to traditional computers.

While the ability to linearize time-dependent signals is insufficient for comprehensive AI systems, the cortex-striatum-thalamus loop is essential. Merging RAG and an LM resembles leveraging a reservoir computer with new impulse-generating capabilities, where the LM acts as the aforementioned cortical loop, the context window is the memory, and the RAG is long-term storage and retrieval. If the RAG system was "powered" by a reservoir network, then it could be trained to perform the process of recall more efficiently by identifying exactly which sections of text should be given to the LM. Plus, reservoir networks can be trained with RL, which would ease adoption into reasoning models.

How would this look like in practice?

In practice, your training data first needs to have delimiters to separate elements into manageable units like factoids, statements, or paragraphs. Each is assigned a node in a simple neuron model (from a reservoir network), connected with weights between nodes. Given text's linear form, maybe the every neuron at the beginning of training would be connected to the "next" neuron, representing a linear reading of your corpus text.

LMs would excite the most active node in a RAG-style scenario, selecting candidates for subsequent consideration. This would be the LM providing input to the reservoir network, like providing input to the cortex. After posing a question to the LM, rather than immediately generating an answer, the LM should think about possible answer components and questions to pose.

The original question is vectorized, exciting similar nodes in the system. Factoids like the Statue of Liberty's buildup or reassembly on Liberty Island arouse relative neuron activities. Eventually, with enough examples, RL could reinforce certain connections between nodes in this RAG-reservoir system, which is akin to synapses in the cortex enforcing certain firing orders.

Focusing less on answers, the LM should develop relevant questions, feeding those into the RAG system. This iterative process refines thoughts and inquiries toward fulfilling information needs, matching active nodes to new reasoning vectors.

Implementing spike timing-dependent plasticity means learning node weights throughout LM processes. Moreover, incorporating refractory periods (which prevents or limits a neuron from spiking too frequently) prevents feeding identical information within recent token sequences.

I think emphasizing neuroscience components in modern-day RAG systems can contribute toward AGI.

Additionally, enabling LMs to create new nodes (representing new facts) when they become distinct enough from the existing corpus and repetitive enough to warrant saving (so that the LM doesn't have to reinvent the wheel) lends confidence to the correctness of the fact. If LMs consistently "think of" ideas absent from RAG corpora (and LM training data set), this might suggest that they are some new piece of information.