R-Horizon:您的大型推理模型在广度与深度上究竟能走多远?
R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
October 9, 2025
作者: Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, Xunliang Cai
cs.AI
摘要
近期,推理模型(如OpenAI o1、DeepSeek-R1)在测试时扩展方面的趋势,通过长链思维(CoT)取得了显著进步。然而,现有基准主要聚焦于即时、单层次任务,未能充分评估模型理解和应对复杂、多层次场景的能力。针对大型推理模型(LRMs)这一评估不足的问题,我们提出了R-HORIZON方法,旨在通过查询组合激发LRMs的长层次推理行为。基于R-HORIZON,我们构建了一个长层次推理基准,包含跨越长推理视野的复杂多步推理任务,这些问题相互依存。通过使用R-HORIZON基准对LRMs进行全面评估,我们发现即使是最先进的LRMs也表现出显著的性能下降。分析表明,LRMs的有效推理长度有限,难以在多个问题间合理分配思考资源。认识到这些局限后,我们利用R-HORIZON构建了带有验证奖励的强化学习(RLVR)所需的长层次推理数据。与使用单层次数据训练相比,结合R-HORIZON的RLVR不仅大幅提升了多层次推理任务的性能,还促进了标准推理任务的准确性,在AIME2024上提高了7.5分。这些成果确立了R-HORIZON作为一种可扩展、可控且低成本的范式,用于增强和评估LRMs的长层次推理能力。
English
Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1,
DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought
(CoT). However, existing benchmarks mainly focus on immediate, single-horizon
tasks, failing to adequately evaluate models' ability to understand and respond
to complex, long-horizon scenarios. To address this incomplete evaluation of
Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to
stimulate long-horizon reasoning behaviors in LRMs through query composition.
Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising
complex multi-step reasoning tasks with interdependent problems that span long
reasoning horizons. Through comprehensive evaluation of LRMs using the
R-HORIZON benchmark, we find that even the most advanced LRMs suffer
significant performance degradation. Our analysis reveals that LRMs exhibit
limited effective reasoning length and struggle to allocate thinking budget
across multiple problems appropriately. Recognizing these limitations, we use
R-HORIZON to construct long-horizon reasoning data for reinforcement learning
with verified rewards (RLVR). Compared to training with single-horizon data,
RLVR with R-HORIZON not only substantially improves performance on the
multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning
tasks, with an increase of 7.5 on AIME2024. These results position R-HORIZON as
a scalable, controllable, and low-cost paradigm for enhancing and evaluating
the long-horizon reasoning capabilities of LRMs.