Dmitry Labintcev

Posted on Jan 15

🚀 Recursive Language Models: The Complete Guide to 10M+ Token Processing

#ai #security #tutorial #llm

🧠 RLM-Toolkit — The Next Paradigm After RAG
💡 "While others wrap APIs in abstractions, we implement a new paradigm: Recursive Language Models.

📊 10M+ tokens. 💰 80% cost reduction. 🔒 Security-first."

description: "From theory to practice: how RLM works, how to implement it with any LLM, and why it changes everything for long-context AI applications."

series: "AI Architecture Deep Dives"

Recursive Language Models: The Complete Guide

From beginner implementation to PhD-level optimization — everything you need to know about the paradigm that scales LLMs to 10M+ tokens.

📖 Table of Contents

The Problem: Why LLMs Fail on Long Contexts
The Solution: RLM Architecture Explained
Hands-On: Implement RLM with Any LLM
Use Cases: Where RLM Shines
Model Comparison: GPT-5 vs Claude vs Qwen vs Open-Source
Advanced: Optimization Techniques
Security Considerations
Future Directions

🎯 Who Is This For?

Level	What You'll Learn
Beginner	Core concepts, first implementation
Intermediate	Production patterns, cost optimization
Advanced/Research	Formal theory, novel applications

1. The Problem: Why LLMs Fail on Long Contexts

1.1 Context Rot: The Silent Killer

Every LLM has a context window — the maximum number of tokens it can process at once:

Model	Context Window	Real-World Limit
GPT-5.2	400K tokens	~250K effective
Claude Opus 4.5	200K tokens	~150K effective
Gemini 3 Pro	1M tokens	~800K effective
Llama 4 Scout	10M tokens	~8M effective

Notice the gap between "advertised" and "effective"? That's context rot — quality degradation as context grows:

Quality(c) = Q₀ × e^(-λc)

where:
  Q₀ = baseline quality
  λ  = decay rate (model-specific)
  c  = context length

1.2 The Evidence

OpenAI's own research (arxiv:2512.24601) showed GPT-5 performance on complex tasks:

Context Size	Simple Task (NIAH)	Complex Task (OOLONG-Pairs)
8K tokens	98%	72%
128K tokens	95%	31%
1M tokens	89%	<0.1% 😱

Translation: For tasks requiring dense information processing (like comparing pairs across a million tokens), even GPT-5 becomes nearly useless.

1.3 Why Traditional Solutions Fail

Chunking:

# Traditional approach
chunks = split(document, size=100000)
results = [llm.analyze(chunk) for chunk in chunks]
final = merge(results)  # ❌ Loses cross-chunk context!

Summarization:

# Lossy compression
summary = llm.summarize(document)  # ❌ Details lost forever!
answer = llm.query(summary, question)

RAG (Retrieval):

# Only retrieves "similar" chunks
relevant = vectordb.search(query, k=10)  # ❌ Misses non-obvious connections!

2. The Solution: RLM Architecture Explained

2.1 The Core Insight

"Long prompts should not be fed into the neural network directly. They should be treated as part of the environment that the LLM can symbolically interact with."

— arxiv:2512.24601

2.2 The Paradigm Shift

┌─────────────────────────────────────────────────────────────┐
│                    TRADITIONAL LLM                          │
│                                                             │
│   [10M tokens] ──→ [Transformer] ──→ [Response]            │
│                         ↓                                   │
│                  ❌ CONTEXT ROT                             │
│                  ❌ MEMORY LIMIT                            │
│                  ❌ COST EXPLOSION                          │
└─────────────────────────────────────────────────────────────┘

                         ⬇️ RLM REVOLUTION ⬇️

┌─────────────────────────────────────────────────────────────┐
│                  RECURSIVE LANGUAGE MODEL                    │
│                                                             │
│   [10M tokens] ──→ [REPL Variable]                         │
│                         ↓                                   │
│   [LLM writes Python code to analyze the variable]         │
│                         ↓                                   │
│   [llm_query() for recursive sub-LM calls]                 │
│                         ↓                                   │
│   [FINAL(answer)] ──→ [Response]                           │
│                                                             │
│                  ✅ NO CONTEXT ROT                          │
│                  ✅ SCALES TO 10M+                          │
│                  ✅ 80-90% COST REDUCTION                   │
└─────────────────────────────────────────────────────────────┘

2.3 The Three Components

1. REPL Environment

# The LLM operates in a Python REPL where:
context = "your 10M token document"  # Stored as variable
# LLM never "sees" all 10M tokens at once!

2. Symbolic Manipulation

# LLM writes code to explore the context:
first_1000_chars = context[:1000]
sections = context.split("---")
matching = [s for s in sections if "keyword" in s]

3. Recursive Sub-calls

# When semantic understanding is needed:
def llm_query(prompt):
    """Call a sub-LLM with up to 500K token capacity"""
    return sub_model.generate(prompt)

# Usage:
summary = llm_query(f"Summarize this section: {sections[0]}")

2.4 Formal Definition (For Researchers)

An RLM is a tuple (L, E, R, S) where:

L: Base language model (root LLM)
E: Execution environment (Python REPL)
R: Recursive mechanism (llm_query function)
S: State (context variable + accumulated variables)

State Machine:

S₀ = (context=P, vars={}, history=[], depth=0)
Transition: Sₙ → Sₙ₊₁ via:
  - code_exec(code) → updates vars, history
  - llm_query(p) → depth++, adds result to vars
  - FINAL(x) → terminate with output x

3. Hands-On: Implement RLM with Any LLM

3.1 Minimal Implementation (50 lines)

import openai  # or anthropic, google.generativeai, etc.

class SimpleRLM:
    def __init__(self, model="gpt-4o"):
        self.model = model
        self.client = openai.OpenAI()

    def run(self, context: str, query: str) -> str:
        # Initialize REPL state
        repl_state = {"context": context}
        history = []

        system_prompt = f"""
You are operating in an RLM (Recursive Language Model) environment.

The variable `context` contains {len(context)} characters of text.
You can write Python code to analyze it.
Use `llm_query(prompt)` to ask semantic questions about chunks.
Return your final answer with FINAL(your_answer).

Query: {query}
"""

        while True:
            # Get next action from LLM
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": system_prompt},
                    *history,
                ],
                max_tokens=4000
            )

            action = response.choices[0].message.content
            history.append({"role": "assistant", "content": action})

            # Check for final answer
            if "FINAL(" in action:
                return self._extract_final(action)

            # Execute code
            output = self._execute_code(action, repl_state)
            history.append({"role": "user", "content": f"Output:\n{output}"})

    def _execute_code(self, code: str, state: dict) -> str:
        # Extract code block
        if "```

python" in code:
            code = code.split("

```python")[1].split("```

")[0]
        elif "

```" in code:
            code = code.split("```

")[1].split("

```")[0]

        # Define llm_query for sub-calls
        def llm_query(prompt):
            resp = self.client.chat.completions.create(
                model="gpt-4o-mini",  # Cheaper for sub-calls
                messages=[{"role": "user", "content": prompt}],
                max_tokens=2000
            )
            return resp.choices[0].message.content

        state["llm_query"] = llm_query

        # Execute (⚠️ sandbox in production!)
        import io, sys
        old_stdout = sys.stdout
        sys.stdout = io.StringIO()

        try:
            exec(code, state)
            output = sys.stdout.getvalue()
        except Exception as e:
            output = f"Error: {e}"
        finally:
            sys.stdout = old_stdout

        return output[:5000]  # Truncate for context management

    def _extract_final(self, text: str) -> str:
        import re
        match = re.search(r'FINAL\((.*?)\)', text, re.DOTALL)
        return match.group(1) if match else text


# Usage
rlm = SimpleRLM()

# Load a massive document
with open("million_token_document.txt") as f:
    huge_doc = f.read()

answer = rlm.run(huge_doc, "What are the key themes across all chapters?")
print(answer)

3.2 Production-Ready Version (with any LLM)

from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Optional, Dict, Any
import hashlib


@dataclass
class RLMConfig:
    root_model: str
    sub_model: str
    max_depth: int = 2
    max_subcalls: int = 100
    max_cost: float = 10.0  # dollars
    timeout_seconds: int = 300


class LLMProvider(ABC):
    """Abstract base for any LLM provider"""

    @abstractmethod
    def generate(self, prompt: str, max_tokens: int) -> str:
        pass

    @abstractmethod
    def get_cost(self, input_tokens: int, output_tokens: int) -> float:
        pass


class OpenAIProvider(LLMProvider):
    """For GPT-5.2, GPT-5, GPT-4o (January 2026)"""

    def __init__(self, model: str = "gpt-5.2"):
        import openai
        self.client = openai.OpenAI()
        self.model = model

    def generate(self, prompt: str, max_tokens: int = 4000) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens
        )
        return response.choices[0].message.content

    def get_cost(self, input_tokens: int, output_tokens: int) -> float:
        # GPT-4o pricing (adjust as needed)
        return (input_tokens * 0.005 + output_tokens * 0.015) / 1000


class AnthropicProvider(LLMProvider):
    """For Claude Opus 4.5, Sonnet 4.5, Haiku 4.5 (January 2026)"""

    def __init__(self, model: str = "claude-opus-4.5-20251115"):
        import anthropic
        self.client = anthropic.Anthropic()
        self.model = model

    def generate(self, prompt: str, max_tokens: int = 4000) -> str:
        response = self.client.messages.create(
            model=self.model,
            max_tokens=max_tokens,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.content[0].text

    def get_cost(self, input_tokens: int, output_tokens: int) -> float:
        return (input_tokens * 0.003 + output_tokens * 0.015) / 1000


class QwenProvider(LLMProvider):
    """For Qwen3 via OpenAI-compatible API (January 2026)"""

    def __init__(self, model: str = "Qwen/Qwen3-235B-A22B-Instruct"):
        import openai
        self.client = openai.OpenAI(
            base_url="https://api.together.xyz/v1",  # or fireworks, hyperbolic
            api_key=os.environ["TOGETHER_API_KEY"]
        )
        self.model = model

    def generate(self, prompt: str, max_tokens: int = 4000) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens
        )
        return response.choices[0].message.content

    def get_cost(self, input_tokens: int, output_tokens: int) -> float:
        return (input_tokens + output_tokens) * 0.0002 / 1000  # Approx


class OllamaProvider(LLMProvider):
    """For local models via Ollama — FREE! (Llama 4, Qwen3, Mistral 3)"""

    def __init__(self, model: str = "llama4-scout:70b"):
        import ollama
        self.model = model

    def generate(self, prompt: str, max_tokens: int = 4000) -> str:
        import ollama
        response = ollama.generate(
            model=self.model,
            prompt=prompt,
            options={"num_predict": max_tokens}
        )
        return response["response"]

    def get_cost(self, input_tokens: int, output_tokens: int) -> float:
        return 0.0  # Local = free!


class ProductionRLM:
    """Production-ready RLM with any LLM provider"""

    def __init__(self, 
                 root_provider: LLMProvider,
                 sub_provider: LLMProvider,
                 config: RLMConfig):
        self.root = root_provider
        self.sub = sub_provider
        self.config = config
        self.total_cost = 0.0
        self.subcall_count = 0

    def run(self, context: str, query: str) -> Dict[str, Any]:
        """Run RLM analysis with full telemetry."""

        repl_state = {
            "context": context,
            "context_hash": hashlib.sha256(context.encode()).hexdigest()[:16]
        }
        history = []
        iterations = 0
        max_iterations = 50

        system_prompt = self._build_system_prompt(context, query)

        while iterations < max_iterations:
            iterations += 1

            # Check budget
            if self.total_cost >= self.config.max_cost:
                return {"error": "Budget exceeded", "cost": self.total_cost}

            # Get next action
            full_prompt = system_prompt + self._format_history(history)
            action = self.root.generate(full_prompt, max_tokens=4000)
            history.append(("assistant", action))

            # Check for final
            if "FINAL(" in action or "FINAL_VAR(" in action:
                return {
                    "answer": self._extract_final(action, repl_state),
                    "iterations": iterations,
                    "subcalls": self.subcall_count,
                    "cost": self.total_cost
                }

            # Execute
            output = self._safe_execute(action, repl_state)
            history.append(("user", f"Execution output:\n{output}"))

        return {"error": "Max iterations reached", "iterations": iterations}

    def _safe_execute(self, code: str, state: dict) -> str:
        """Sandboxed code execution with sub-LM support."""

        # Extract code
        code = self._extract_code(code)

        # Define safe llm_query
        def llm_query(prompt: str) -> str:
            if self.subcall_count >= self.config.max_subcalls:
                return "[ERROR: Max subcalls reached]"

            self.subcall_count += 1
            result = self.sub.generate(prompt, max_tokens=2000)
            self.total_cost += self.sub.get_cost(len(prompt)//4, len(result)//4)
            return result

        # Sandbox with allowed builtins only
        safe_builtins = {
            'len': len, 'str': str, 'int': int, 'float': float,
            'list': list, 'dict': dict, 'set': set, 'tuple': tuple,
            'range': range, 'enumerate': enumerate, 'zip': zip,
            'sorted': sorted, 'reversed': reversed, 'sum': sum,
            'min': min, 'max': max, 'abs': abs, 'round': round,
            'print': print, 'isinstance': isinstance, 'type': type,
        }

        # Allow safe imports
        import re
        import json
        allowed_modules = {'re': re, 'json': json}

        namespace = {
            **state,
            "llm_query": llm_query,
            "__builtins__": safe_builtins,
            **allowed_modules
        }

        # Capture output
        import io, sys
        old_stdout = sys.stdout
        sys.stdout = buffer = io.StringIO()

        try:
            exec(code, namespace)
            output = buffer.getvalue()

            # Update state with new variables
            for k, v in namespace.items():
                if k not in ["__builtins__", "llm_query", "context"]:
                    if isinstance(v, (str, int, float, list, dict)):
                        state[k] = v

        except Exception as e:
            output = f"Error: {type(e).__name__}: {e}"
        finally:
            sys.stdout = old_stdout

        return output[:10000]  # Truncate

    def _build_system_prompt(self, context: str, query: str) -> str:
        return f"""# RLM Environment

You are a Recursive Language Model operating in a Python REPL.

## Available Resources
- `context`: string variable with {len(context):,} characters
- `llm_query(prompt)`: call sub-LLM for semantic analysis (max {self.config.max_subcalls} calls)
- Python code execution with: re, json, basic builtins

## Your Task
{query}

## Instructions
1. Explore the context using Python (slicing, regex, splitting)
2. Use llm_query() for semantic understanding of chunks
3. Build up your answer in variables
4. Return with FINAL(answer) or FINAL_VAR(variable_name)

## Example

python

Split into sections

sections = context.split("\n\n")
print(f"Found {{len(sections)}} sections")

Analyze first section semantically

analysis = llm_query(f"What is the main topic? {{sections[0][:5000]}}")
print(analysis)


Begin now. Write Python code to start analyzing.
"""

    def _extract_code(self, text: str) -> str:
        if "```

python" in text:
            return text.split("

```python")[1].split("```

")[0]
        elif "

```" in text:
            return text.split("```

")[1].split("

```")[0]
        return text

    def _format_history(self, history: list) -> str:
        formatted = "\n\n---\n\n"
        for role, content in history[-10:]:  # Keep last 10 turns
            formatted += f"**{role.upper()}:**\n{content}\n\n"
        return formatted

    def _extract_final(self, text: str, state: dict) -> str:
        import re

        # FINAL_VAR(varname) — return variable content
        var_match = re.search(r'FINAL_VAR\((\w+)\)', text)
        if var_match:
            var_name = var_match.group(1)
            return str(state.get(var_name, f"[Variable '{var_name}' not found]"))

        # FINAL(content) — return content directly
        match = re.search(r'FINAL\((.*?)\)', text, re.DOTALL)
        return match.group(1) if match else text


# ============================================
# USAGE EXAMPLES WITH DIFFERENT PROVIDERS
# ============================================

# Example 1: OpenAI (GPT-5 root, GPT-4o-mini sub)
def example_openai():
    config = RLMConfig(
        root_model="gpt-5",
        sub_model="gpt-4o-mini",
        max_cost=5.0
    )

    rlm = ProductionRLM(
        root_provider=OpenAIProvider("gpt-5"),
        sub_provider=OpenAIProvider("gpt-4o-mini"),
        config=config
    )

    return rlm.run(huge_document, "Summarize all key findings")


# Example 2: Claude (Sonnet root, Haiku sub)
def example_claude():
    config = RLMConfig(
        root_model="claude-3-5-sonnet",
        sub_model="claude-3-haiku",
        max_cost=5.0
    )

    rlm = ProductionRLM(
        root_provider=AnthropicProvider("claude-3-5-sonnet-20241022"),
        sub_provider=AnthropicProvider("claude-3-haiku-20240307"),
        config=config
    )

    return rlm.run(huge_document, "Find all security vulnerabilities")


# Example 3: Fully Local with Ollama (FREE!)
def example_local():
    config = RLMConfig(
        root_model="llama3.2:70b",
        sub_model="llama3.2:8b",
        max_cost=1000.0  # Irrelevant for local
    )

    rlm = ProductionRLM(
        root_provider=OllamaProvider("llama3.2:70b"),
        sub_provider=OllamaProvider("llama3.2:8b"),
        config=config
    )

    return rlm.run(huge_document, "Analyze the codebase structure")


# Example 4: Hybrid (Cloud root, Local sub for cost savings)
def example_hybrid():
    config = RLMConfig(
        root_model="gpt-4o",
        sub_model="llama3.2:8b",
        max_cost=2.0
    )

    rlm = ProductionRLM(
        root_provider=OpenAIProvider("gpt-4o"),
        sub_provider=OllamaProvider("llama3.2:8b"),  # Free sub-calls!
        config=config
    )

    return rlm.run(huge_document, "Deep analysis with unlimited sub-calls")

4. Use Cases: Where RLM Shines

4.1 Codebase Analysis

# Analyze entire repository (10M+ tokens)
codebase = load_repository("./my_project")

result = rlm.run(codebase, """
Find all:
1. Security vulnerabilities (SQL injection, XSS, etc.)
2. Code duplication across files
3. Circular dependencies
4. Dead code
""")

Why RLM wins: Traditional tools analyze file-by-file. RLM tracks cross-file patterns like:

Data flowing from UserInput.java → Database.java → API.java
Circular imports spanning 5+ files
Duplicated logic with slightly different variable names

4.2 Legal Document Analysis

# Analyze 500-page contract
contract = load_pdf("merger_agreement.pdf")

result = rlm.run(contract, """
1. List all parties and their obligations
2. Find conflicting clauses
3. Identify unusual terms compared to standard M&A agreements
4. Extract all deadlines and penalties
""")

4.3 Research Paper Synthesis

# Synthesize 100 papers on a topic
papers = "\n\n---PAPER---\n\n".join(load_papers("machine_learning_2024/"))

result = rlm.run(papers, """
Create a literature review covering:
1. Main research themes
2. Contradicting findings
3. Methodological trends
4. Research gaps
""")

4.4 Multi-Turn Conversation Analysis

# Analyze year of customer support conversations
conversations = load_conversations("support_2024.json")

result = rlm.run(conversations, """
Identify:
1. Most common issues
2. Escalation patterns
3. Resolution success rates by category
4. Customer sentiment progression
""")

5. Model Comparison: Which LLM for RLM? (January 2026)

5.1 Current Model Landscape

Model	Release	Context	Specialty
GPT-5.2	Dec 2025	400K	Best reasoning, 6.2% hallucination
Claude Opus 4.5	Nov 2025	200K	Coding, creative writing
Gemini 3 Pro	Dec 2025	1M	100% AIME 2025, long context
Gemini 3 Flash	Dec 2025	1M	78% SWE-bench, fast
Qwen3-235B	Apr 2025	128K	Open-source flagship
Llama 4 Scout	Jan 2026	10M	Open, MoE, multimodal
Mistral Large 3	Dec 2025	128K	92% of GPT-5.2, cheap
DeepSeek V3.2	Dec 2025	128K	Open-source, 685B params

5.2 Performance Comparison for RLM

Model	Code Gen	Sub-call Efficiency	Cost	Best For
GPT-5.2	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	$$$$	Complex reasoning, research
Claude Opus 4.5	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	$$$$	Code-heavy, creative
Claude Sonnet 4.5	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	$$$	Production workhorse
Gemini 3 Pro	⭐⭐⭐⭐	⭐⭐⭐⭐	$$$	Native 1M context tasks
Gemini 3 Flash	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	$$	Speed + value
Qwen3-235B	⭐⭐⭐⭐	⭐⭐⭐⭐	$	Open-source, self-hosted
Llama 4 Scout	⭐⭐⭐⭐	⭐⭐⭐⭐	FREE	10M native (!), local
Mistral Large 3	⭐⭐⭐⭐	⭐⭐⭐⭐	$$	Cost-effective quality
DeepSeek V3.2	⭐⭐⭐⭐	⭐⭐⭐	$	Research, open weights

5.3 Recommended Configurations (2026)

💰 Budget-Conscious:

root = GeminiProvider("gemini-3-flash")    # Fast, cheap, 1M context
sub = OllamaProvider("llama4-scout:8b")    # Free local
# Total: ~$0.05 per 10M token analysis

🏆 Maximum Quality:

root = OpenAIProvider("gpt-5.2")           # Best reasoning
sub = AnthropicProvider("claude-haiku-4.5") # Fast, accurate
# Total: ~$2-4 per 10M token analysis

🔒 Privacy-First (100% Local):

root = OllamaProvider("llama4-scout:70b")  # 10M native context!
sub = OllamaProvider("qwen3:7b")           # Fast inference
# Total: $0 + electricity
# Note: Llama 4 Scout has 10M context — RLM optional!

🏢 Enterprise (Claude):

root = AnthropicProvider("claude-opus-4.5")   # Best code gen
sub = AnthropicProvider("claude-haiku-4.5")   # Very fast
# Total: ~$1-2 per 10M token analysis

⚡ Speed-Optimized:

root = GeminiProvider("gemini-3-flash")    # Fast + smart
sub = GeminiProvider("gemini-3-flash")     # Same for consistency
# Total: ~$0.30 per 10M, fastest option

🔬 Research (Open-Source Only):

root = DeepSeekProvider("deepseek-v3.2")   # 685B, open weights
sub = QwenProvider("qwen3-32b")            # Strong, open
# Total: Self-hosted cost only

6. Advanced: Optimization Techniques

6.1 Async Sub-calls (10x Speed)

import asyncio

async def parallel_llm_query(prompts: list) -> list:
    """Execute sub-calls in parallel."""
    tasks = [sub_provider.agenerate(p) for p in prompts]
    return await asyncio.gather(*tasks)

# In REPL code:
# chunks = split_context(context, 100000)
# results = await parallel_llm_query([f"Analyze: {c}" for c in chunks])

6.2 Smart Chunking

def smart_chunk(text: str, target_size: int = 100000) -> list:
    """Chunk by semantic boundaries, not arbitrary cuts."""

    # Try to split by major sections
    if "\n## " in text:  # Markdown headers
        return text.split("\n## ")
    elif "\n\n\n" in text:  # Paragraph breaks
        return text.split("\n\n\n")
    else:
        # Fallback to sentence boundaries
        import nltk
        sentences = nltk.sent_tokenize(text)
        chunks, current = [], ""
        for s in sentences:
            if len(current) + len(s) > target_size:
                chunks.append(current)
                current = s
            else:
                current += " " + s
        if current:
            chunks.append(current)
        return chunks

6.3 Caching for Repeated Patterns

from functools import lru_cache
import hashlib

@lru_cache(maxsize=1000)
def cached_llm_query(prompt_hash: str, prompt: str) -> str:
    return sub_provider.generate(prompt)

def llm_query(prompt: str) -> str:
    prompt_hash = hashlib.md5(prompt.encode()).hexdigest()
    return cached_llm_query(prompt_hash, prompt)

6.4 Progressive Refinement

# First pass: cheap model, broad strokes
coarse_result = rlm_with_cheap_models.run(context, query)

# Second pass: expensive model, focused analysis
refined_query = f"""
Based on this initial analysis:
{coarse_result}

Now provide a detailed, accurate answer to: {query}
"""
final_result = rlm_with_expensive_models.run(relevant_sections, refined_query)

7. Security Considerations

7.1 REPL Sandboxing (CRITICAL)

# ❌ NEVER do this in production
exec(llm_generated_code)  # RCE vulnerability!

# ✅ Use restricted execution
BLOCKED = ['os', 'subprocess', 'sys', 'socket', 'eval', 'exec', '__import__']

def safe_exec(code: str, namespace: dict):
    for blocked in BLOCKED:
        if blocked in code:
            raise SecurityError(f"Blocked: {blocked}")

    # Use RestrictedPython or similar
    exec(code, {"__builtins__": SAFE_BUILTINS}, namespace)

7.2 Recursion Limits

class RecursionGuard:
    def __init__(self, max_depth=2, max_calls=100, max_cost=10.0):
        self.max_depth = max_depth
        self.max_calls = max_calls
        self.max_cost = max_cost
        self.current_depth = 0
        self.total_calls = 0
        self.total_cost = 0.0

    def check(self, cost: float):
        self.total_calls += 1
        self.total_cost += cost

        if self.current_depth > self.max_depth:
            raise RecursionError("Max depth exceeded")
        if self.total_calls > self.max_calls:
            raise RuntimeError("Max sub-calls exceeded")
        if self.total_cost > self.max_cost:
            raise RuntimeError("Budget exceeded")

7.3 Context Integrity

def verify_context_integrity(original: str, current: str) -> bool:
    """Detect if context was manipulated."""
    import hashlib
    original_hash = hashlib.sha256(original.encode()).hexdigest()
    current_hash = hashlib.sha256(current.encode()).hexdigest()
    return original_hash == current_hash

8. Future Directions

8.1 Trained RLMs

Current RLMs use general-purpose LLMs. Future work:

RLM-specific training: Models trained to operate as RLMs
Better REPL awareness: Understanding of variable state
Efficient recursion: Knowing when (not) to sub-call

8.2 Deeper Recursion

Paper uses depth=1 (root → sub). Future:

Depth=2+ for hierarchical analysis
Self-modifying recursion strategies
Meta-RLMs that optimize their own chunking

8.3 Multi-Modal RLMs

Apply RLM paradigm to:

Images: 1000 images as "context variable"
Video: Frame-by-frame semantic analysis
Audio: Transcript + waveform analysis

🏁 Conclusion

RLM is not just an optimization — it's a paradigm shift:

Before RLM	After RLM
Context limit: 1M tokens	Scales to 10M+
Cost: $15-30 per large analysis	Cost: $1-3 (80-90% reduction)
Complex tasks: <1% accuracy	Complex tasks: 58%+ accuracy
Cross-document patterns: missed	Cross-document patterns: detected