How the Recursive Language Model (RLM) Works
James H. Wade
2026-02-15
Source:vignettes/articles/how-rlm-works.Rmd
how-rlm-works.RmdThe Problem: Context Rot
Modern LLMs accept enormous context windows. GPT-5 handles a million tokens; Gemini stretches to two million. But bigger windows do not solve the fundamental problem.
As context grows, performance degrades: details get lost and answers go wrong. The MIT researchers who introduced Recursive Language Models call this context rot, the empirical observation that output quality deteriorates as prompts grow, even when the relevant information is technically within the window (see Zhang, Kraska, and Khattab 2025). The model misses what it needs with increasing frequency as input length grows.
And there is no adaptive retrieval. The model cannot decide to re-read section 14 after discovering something relevant in section 42. It processes the entire input in one pass and produces output from whatever signal survived.
The Insight: Context as Environment
The core idea behind RLMs is simple: don’t put the context in the prompt. Instead, store it as a variable in a programming environment and let the model write code to explore it.
A traditional call looks like this:
llm$chat(paste("Summarize this document:", huge_document))An RLM inverts the relationship. The document lives outside the model
as a variable in an R session, and the model generates code to interact
with it. When you call run(), dsprrr provides a REPL: the
model writes R code, dsprrr executes it in a subprocess, and the printed
output feeds back into the next iteration. A typical exploration might
look like this:
# The model generates and executes code like this:
intro <- peek(.context$document, 1, 2000)
findings <- search(.context$document, "\\b(conclusion|finding|result)\\b")
section_42 <- peek(.context$document, 85000, 90000)
SUBMIT(answer = "The document concludes that...")The shift is from treating context as input to treating it as environment. The model reads what it needs, skips what it doesn’t, and revisits sections as its picture of the data develops.
In dsprrr’s API, the module receives input arguments (e.g.,
question), holds context variables (e.g.,
document) outside the prompt, and exposes
llm_query() so the model can delegate sub-questions to a
secondary model from generated code. In the paper’s notation (Zhang, Kraska, and Khattab 2025), these
correspond to a query
,
context
,
and a recursive tool call
that spawns an isolated sub-instance with a new query and a transformed
slice of the context.
Origin and Ecosystem
RLMs were introduced by Alex Zhang, Tim Kraska, and Omar Khattab at MIT (Zhang, Kraska, and Khattab 2025). On BrowseComp-Plus (a benchmark with 6–11 million token inputs), standard models scored 0% while an RLM powered by GPT-5 achieved 91.33%. That comparison is less “RLM beats prompting” than “RLM makes previously intractable tasks tractable”; inputs that large exceed every current model’s context window. The fairer apples-to-apples result is that their post-trained RLM-Qwen3-8B outperformed the base Qwen3-8B by 28.3% on average across long-context tasks.
The idea has since spread quickly. DSPy integrated RLMs as a
first-class module (dspy.RLM) in
version 3.1.2+, using a Pyodide WASM sandbox for code execution.
Google’s Agent Development Kit re-implemented
the pattern with Gemini models. The official rlm
Python package, a community
implementation, and Prime Intellect’s research
program round out the ecosystem. The comparison table below summarizes
the key differences.
dsprrr’s rlm_module() brings the same approach to R,
using R as the REPL language instead of Python, with subprocess
isolation via callr and
structured outputs via ellmer.
How dsprrr Implements RLM
The rest of this article walks through dsprrr’s implementation. For a
practical example, see
vignette("tutorial-rlm-dsprrr", package = "dsprrr"), which
uses rlm_module() to trace a theming bug across the bslib,
shiny, and brand.yml codebases.
The User-Facing API
Creating an RLM module requires two things: a signature declaring inputs and outputs, and a code runner for subprocess isolation:
library(dsprrr)
runner <- r_code_runner(timeout = 30)
rlm <- rlm_module(
signature = "document, question -> answer",
runner = runner
)For tasks that benefit from recursive sub-queries, you can wire up a secondary model:
rlm <- rlm_module(
signature = "document, question -> answer",
runner = runner,
sub_lm = ellmer::chat_openai(model = "gpt-5-mini"),
max_llm_calls = 10
)You can also inject custom R functions as tools available in the
REPL. The factory validates these against reserved names
(SUBMIT, peek, search,
llm_query, etc.) to prevent collisions.
The REPL Loop
When you call run(rlm, ...), dsprrr dispatches to the
module’s internal forward() method. This is where the REPL
loop lives. A PredictModule’s forward() makes
a single call; the RLM module loops, up to max_iterations
times (default 20):
# From R/module-rlm.R (error handling, sub-query interception, and
# fallback extraction omitted -- see those sections below)
for (iter in seq_len(self$max_iterations)) {
# Build prompt including all previous iterations
prompt <- private$build_iteration_prompt(system_prompt, history, iter)
# Ask the model to generate R code
response <- private$get_code_response(llm, prompt)
# Execute code in isolated subprocess with RLM tools injected
exec_result <- private$execute_with_rlm_tools(
response$code,
inputs,
call_counter
)
# Record in history -- the model sees this on the next iteration
history[[iter]] <- list(
iteration = iter,
reasoning = response$reasoning,
code = response$code,
output = exec_result$formatted_output,
success = exec_result$success,
is_final = exec_result$is_final
)
# SUBMIT() terminates the loop
if (exec_result$is_final) {
final_answer <- exec_result$final_value
break
}
}Each iteration produces a structured response with two fields:
reasoning (the model’s explanation of its plan for that
step) and code (R code to execute). This uses ellmer’s
structured output support:
# From R/module-rlm.R -- structured code generation
output_type <- ellmer::type_object(
reasoning = ellmer::type_string("Your thought process for this step"),
code = ellmer::type_string("R code to execute")
)
result <- llm$chat_structured(prompt, type = output_type)The accumulated history gives the model a growing record of what it has tried. Failed executions are included. The model sees its own errors and can correct course.
REPL Tools
Before each execution, dsprrr injects a “prelude” that defines the
tools available in the subprocess. These are defined in
R/rlm-tools.R:
peek(var, start, end) views a slice of a variable. It
dispatches on the input type: for a character vector, start
and end are element indices; for a single string, they are
character positions:
# From R/rlm-tools.R
peek <- function(var, start = 1L, end = 1000L) {
if (!is.character(var)) {
var <- as.character(var)
}
if (length(var) > 1) {
return(var[max(1L, start):min(length(var), end)])
}
substr(var, max(1L, start), min(nchar(var), end))
}search(var, pattern) runs a Perl-compatible regex
against a variable and returns all matches:
# From R/rlm-tools.R
search <- function(var, pattern, ignore_case = FALSE) {
if (!is.character(var)) {
var <- as.character(var)
}
if (length(var) > 1) {
var <- paste(var, collapse = "\n")
}
matches <- regmatches(
var,
gregexpr(pattern, var, ignore.case = ignore_case, perl = TRUE)
)
unlist(matches)
}SUBMIT(...) terminates the loop and returns the final
answer. It validates that the provided values match the signature’s
output fields, supporting both positional
(SUBMIT("my answer")) and named
(SUBMIT(answer = "my answer")) arguments:
# From R/rlm-tools.R -- simplified; see source for full validation logic
SUBMIT <- function(...) {
args <- list(...)
arg_names <- names(args)
has_any_names <- any(nzchar(arg_names %||% ""))
if (!has_any_names) {
# Positional: match by order against signature output fields
names(args) <- .rlm_output_fields
} else {
# Named: validate that all required fields are present
missing <- setdiff(.rlm_output_fields, arg_names)
if (length(missing) > 0) {
stop("SUBMIT() missing outputs: ", paste(missing, collapse = ", "))
}
}
class(args) <- c("rlm_final", class(args))
args
}The .rlm_output_fields variable is not magic. It is
injected into the subprocess by the same prelude that defines
peek(), search(), and SUBMIT()
itself. The prelude reads the signature’s output field names and writes
them as a character vector at the top of the execution script.
The rlm_final class is a sentinel: when the parent
process sees it in the subprocess result, it exits the loop and extracts
the answer.
Subprocess Isolation
All model-generated code runs in an isolated R subprocess via
callr::r(). The RCodeRunner class in
R/r-code-runner.R handles this:
# From R/r-code-runner.R -- subprocess execution (simplified)
exec_result <- callr::r(
func = private$execution_wrapper,
args = list(code = code, context = context, ...),
timeout = self$timeout,
stdout = stdout_file,
stderr = stderr_file,
user_profile = FALSE
)Inside the subprocess, a fresh environment is created with the
context available as .context. The wrapper overrides
library() and require() to enforce a package
allowlist, and a pattern scanner rejects calls to system(),
unlink(), quit(), and
download.file(). Each iteration spawns a new subprocess, so
each pays a cold-start cost (typically 200–400ms depending on
platform).
Recursive Sub-Queries
This is where the “recursive” in RLM comes from. When
sub_lm is provided, the model can write
llm_query("What does section 3 say?", context_slice) in its
generated code. The function does not execute the sub-call inside the
subprocess, which would be a security problem. Instead, it returns a
marker object:
# From R/rlm-tools.R -- returns a marker, not a result
llm_query <- function(query, context_slice = NULL) {
structure(
list(query = query, context = context_slice, batch = FALSE),
class = "rlm_query_request"
)
}The parent process intercepts this marker after execution, performs
the actual call, and feeds the result back on the next iteration. A
batched variant, llm_query_batched(), allows multiple
sub-questions at once, running concurrently via
ellmer::parallel_chat() when available.
A shared call counter tracks total calls across all iterations and
enforces the max_llm_calls budget, preventing runaway
recursion.
Fallback and Output Normalization
If the model exhausts all iterations without calling
SUBMIT(), the module performs fallback extraction: it feeds
the entire exploration trajectory back and asks for a synthesized answer
from what was discovered. This uses a two-phase approach, trying
structured output via chat_structured() first, then
unstructured chat() if that fails.
The final answer passes through output normalization, which coerces
whatever was produced into the signature’s declared output fields. This
handles named lists, positional lists, scalar values, and
case-insensitive enum matching (e.g., "Positive" is mapped
to "positive" if the signature declares
type_enum("positive", "negative")).
Observability
Every RLM execution records its full REPL history: reasoning, code, output, success or failure, and timing for each iteration:
history <- rlm$get_repl_history()
last_run <- history[[length(history)]]
last_run$iterations_used
#> [1] 5
last_run$llm_calls_used
#> [1] 3You can see what the model tried, where it went wrong, and how it recovered.
How dsprrr’s RLM Compares
The table below summarizes the key implementations:
| dsprrr | DSPy | Official rlm
|
Google ADK | |
|---|---|---|---|---|
| Language | R | Python | Python | Python |
| REPL | R via callr | Python via Pyodide/WASM | Python (isolated or not) | Python via ADK |
| Sandbox | Subprocess (callr) | Deno/WASM | Configurable | ADK orchestration |
| Structured output | ellmer types | DSPy signatures | Freeform | ADK tools |
| Recursive calls | llm_query() |
Built-in | Built-in | Child agents |
| Optimization | Teleprompters, grid search | DSPy optimizers | Manual | Manual |
| Batched sub-calls | llm_query_batched() |
– | – | – |
The “batched sub-calls” row refers specifically to issuing multiple
recursive sub-queries from a single REPL iteration and running them
concurrently. All implementations support multiple sequential sub-calls;
dsprrr additionally batches them via
ellmer::parallel_chat().
dsprrr is the only implementation that uses R as the REPL language, which matters when your context is R data: data frames, model objects, or package source code. It also inherits dsprrr’s full optimization infrastructure (teleprompters, grid search, evaluation metrics), so you can systematically improve RLM performance, not just run it.
When to Use RLMs (and When Not To)
The hard part of any task here is either finding the right context or reasoning about it once found. RLMs help with the first problem. If the context is already short and well-scoped, simpler approaches are faster and cheaper.
| Approach | Best for | Latency | Context limit |
|---|---|---|---|
PredictModule |
Short, self-contained tasks | Low | Context window |
chain_of_thought() |
Complex reasoning, known context | Medium | Context window |
rag_module() |
Lookup in large corpora | Medium | Chunk size |
rlm_module() |
Exploration of large, interconnected data | High | Unlimited* |
*Bounded by max_iterations and
max_llm_calls, not by context window size.
Skip RLMs when the context is short. If the document
fits comfortably in the context window, a PredictModule or
chain_of_thought() will be faster and cheaper. The overhead
of multiple REPL round-trips is not justified when one call
suffices.
Skip RLMs when the task is well-defined. If you know what you are looking for (extracting a specific field from a known document format, say), a prompt-optimized module will outperform an RLM. RLMs spend iterations discovering a good exploration path. If you already know the path, skip the discovery.
Skip RLMs when cost-per-query matters. An RLM with 15 iterations makes at least 15 calls, plus any recursive sub-queries. For a production pipeline processing thousands of inputs, that multiplier adds up. If a single-call module with good prompting gets you 80% of the accuracy at 5% of the cost, the economics favor the simpler approach.
Skip RLMs when the context contains bad information.
RLMs gather more evidence than simpler approaches, which is usually
beneficial. But more evidence also means more surface area for
misleading content. If the context contains contradictions, outdated
facts, or adversarial content, a chain_of_thought() module
with curated context gives explicit control over what the model
sees.
Improvement Opportunities
dsprrr’s RLM implementation is functional but young. The items below are split into design constraints that affect deployment and API improvements that would make the module more ergonomic.
Design Constraints
Process-level isolation is not a security boundary. The package allowlist and pattern scanner are defense-in-depth, not a sandbox. A determined adversary could escape them. For untrusted inputs in production, OS-level containers (Docker, AppArmor, etc.) are the right layer. dsprrr should make it easy to plug those in.
No persistent REPL state across iterations. Each subprocess starts fresh. Variables created in iteration 3 are not available in iteration 4; the model has to re-derive them or read them from the printed output history. A persistent session (see below) would allow working state to accumulate across iterations, closer to how a human uses an interactive R session.
Subprocess cold-start overhead. Each REPL iteration
spawns a fresh callr::r() process, costing 200–400ms per
iteration. Over a 15-iteration run, that is 3–6 seconds of pure
subprocess overhead. Switching to a persistent
callr::r_session worker would eliminate cold starts and
enable persistent state, at the cost of weaker iteration-to-iteration
isolation.
API Improvements
peek() dual-dispatch API.
peek() silently changes meaning depending on whether the
input is a single string (character positions) or a character vector
(element indices). The model has to infer which form
.context$document is in, and if it guesses wrong, the slice
is nonsensical. Splitting into peek_chars() and
peek_lines() (or adding a unit argument) would
make the contract explicit.
search() returns raw matches, not
locations. The model gets the matched text but not the byte
offset or surrounding context, so it often has to follow up with a
peek() call to figure out where the match
occurred. Returning match positions or a concordance-style snippet would
save an iteration.
What’s Next
RLMs trade latency for reach. They can explore contexts that no model handles well in a single pass, with subprocess isolation, recursive sub-queries, and full optimization support through dsprrr’s teleprompters and grid search.
For a hands-on walkthrough, see
vignette("tutorial-rlm-dsprrr", package = "dsprrr"), which
uses rlm_module() to trace a real theming bug across bslib,
shiny, and brand.yml: nearly 4 million characters of source, explored in
under 15 iterations.