递归语言模型邂逅不确定性:自反式程序搜索在长上下文中的惊人成效
Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context
March 7, 2026
作者: Keivan Alizadeh, Parshin Shojaee, Minsik Cho, Mehrdad Farajtabar
cs.AI
摘要
長文本處理仍是語言模型的核心挑戰:即便具備擴展的上下文窗口,模型仍難以可靠地提取、推理和利用長上下文中的信息。近期如遞歸語言模型(RLM)等研究通過智能體化方式,在推理時以程序化交互將長上下文分解為遞歸子調用,以應對這一挑戰。儘管前景可期,但RLM的成功關鍵取決於這些上下文交互程序的選擇策略,而這一問題至今尚未得到充分探索。本文針對該問題提出SRLM框架,通過引入不確定性感知的自我反思機制增強程序化上下文交互。SRLM利用三種內在信號:自我一致性、推理長度和口頭化置信度,作為模型內部不確定性的互補指標,藉此評估和比較候選上下文交互程序。在多樣化基準數據集、上下文長度和骨幹模型上的大量實驗表明,SRLM始終優於現有頂尖基準模型,在相同時間預算下較RLM實現最高22%的性能提升。我們發現遞歸本身並非RLM性能的主要驅動因素,而簡單的自我反思式程序搜索無需自查詢或顯式遞歸機制即可達到或超越RLM效果。對於模型窗口內的上下文長度,帶遞歸的RLM往往會降低基礎模型性能,而SRLM在長短上下文中均能實現穩定增益。研究還表明,在語義密集型任務中,啟發式程序搜索不足且需要更廣泛上下文理解時,RLM效果有限,而SRLM的自我反思機制能提供更優的語義信號來引導推理。
English
Long-context handling remains a core challenge for language models: even with extended context windows, models often fail to reliably extract, reason over, and use the information across long contexts. Recent works like Recursive Language Models (RLM) have approached this challenge by agentic way of decomposing long contexts into recursive sub-calls through programmatic interaction at inference. While promising, the success of RLM critically depends on how these context-interaction programs are selected, which has remained largely unexplored. In this paper, we study this problem and introduce SRLM, a framework that augments programmatic context interaction with uncertainty-aware Self-Reflection. SRLM leverages three intrinsic signals: self consistency, reasoning length, and verbalized confidence. These serve as complementary indicators of a model's internal uncertainty, and the model uses them to evaluate and compare candidate context-interaction programs. Extensive experiments across diverse benchmark datasets, context lengths, and backbone models, show that SRLM consistently outperforms state-of-the-art baselines, yielding up to 22% improvement over RLM under the same time budget. Our findings show that recursion itself is not the primary driver of performance in RLM, and a simple self-reflective program search can match or surpass RLM without requiring self-query or explicit recursion mechanisms. We find that for context lengths within the model's window, RLMs with recursion often degrade performance relative to the base model, whereas SRLM yields consistent gains across both short and long contexts. We also find that RLM is less effective in tasks with semantically intensive nature, where heuristic program search is insufficient and broader contextual understanding is required, while self-reflection in SRLM provides a semantic signal that better steers reasoning in these scenarios.