递归语言模型邂逅不确定性：自反式程序搜索在长上下文中的惊人成效

摘要

长文本处理始终是语言模型的核心挑战：即使具备扩展的上下文窗口，模型仍难以可靠地提取、推理和利用长文本中的信息。近期提出的递归语言模型（RLM）通过智能代理方式，在推理时以程序化交互将长文本分解为递归子调用来解决这一难题。虽然前景可观，但RLM的成功关键取决于上下文交互程序的选择策略，而这一领域尚未得到充分探索。本文针对该问题提出SRLM框架，通过引入具备不确定性感知的自反思机制来增强程序化上下文交互。SRLM利用三种内在信号——自洽性、推理长度和显性置信度——作为模型内部不确定性的互补指标，借此评估和比较候选的上下文交互程序。在多样化基准数据集、文本长度和骨干模型上的大量实验表明，SRLM始终优于最先进的基线模型，在相同时间预算下较RLM提升达22%。我们的研究发现：递归本身并非RLM性能提升的主因，简单的自反思程序搜索即可匹配或超越RLM，且无需自查询或显式递归机制；对于模型窗口内的文本长度，带递归的RLM反而会降低基线模型性能，而SRLM在长短文本中均能稳定提升；在语义密集型任务中，RLM的启发式程序搜索效果有限，而SRLM的自反思机制能提供语义信号，更有效地引导推理过程。

English

Long-context handling remains a core challenge for language models: even with extended context windows, models often fail to reliably extract, reason over, and use the information across long contexts. Recent works like Recursive Language Models (RLM) have approached this challenge by agentic way of decomposing long contexts into recursive sub-calls through programmatic interaction at inference. While promising, the success of RLM critically depends on how these context-interaction programs are selected, which has remained largely unexplored. In this paper, we study this problem and introduce SRLM, a framework that augments programmatic context interaction with uncertainty-aware Self-Reflection. SRLM leverages three intrinsic signals: self consistency, reasoning length, and verbalized confidence. These serve as complementary indicators of a model's internal uncertainty, and the model uses them to evaluate and compare candidate context-interaction programs. Extensive experiments across diverse benchmark datasets, context lengths, and backbone models, show that SRLM consistently outperforms state-of-the-art baselines, yielding up to 22% improvement over RLM under the same time budget. Our findings show that recursion itself is not the primary driver of performance in RLM, and a simple self-reflective program search can match or surpass RLM without requiring self-query or explicit recursion mechanisms. We find that for context lengths within the model's window, RLMs with recursion often degrade performance relative to the base model, whereas SRLM yields consistent gains across both short and long contexts. We also find that RLM is less effective in tasks with semantically intensive nature, where heuristic program search is insufficient and broader contextual understanding is required, while self-reflection in SRLM provides a semantic signal that better steers reasoning in these scenarios.