ChatPaper.aiChatPaper

R-視界:您的大型推理模型在廣度與深度上究竟能走多遠?

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

October 9, 2025
作者: Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, Xunliang Cai
cs.AI

摘要

近期,在推理模型(如OpenAI o1、DeepSeek-R1)的测试时间缩放趋势中,通过长链思维(Chain-of-Thought, CoT)取得了显著进展。然而,现有的基准测试主要集中于即时、单一视野的任务,未能充分评估模型理解和应对复杂、长期视野场景的能力。针对大型推理模型(Large Reasoning Models, LRMs)评估的这一不足,我们提出了R-HORIZON方法,旨在通过查询组合激发LRMs的长期视野推理行为。基于R-HORIZON,我们构建了一个长期视野推理基准,包含跨越长推理视野的复杂多步推理任务及相互依赖的问题。通过使用R-HORIZON基准对LRMs进行全面评估,我们发现即使是最先进的LRMs也表现出显著的性能下降。分析显示,LRMs的有效推理长度有限,且难以在多个问题间合理分配思考预算。认识到这些局限后,我们利用R-HORIZON构建了带有验证奖励的强化学习(Reinforcement Learning with Verified Rewards, RLVR)所需的长期视野推理数据。与使用单一视野数据训练相比,结合R-HORIZON的RLVR不仅大幅提升了多视野推理任务的性能,还促进了标准推理任务的准确性,在AIME2024上提高了7.5分。这些成果确立了R-HORIZON作为一种可扩展、可控且低成本的范式,用于增强和评估LRMs的长期视野推理能力。
English
Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks, with an increase of 7.5 on AIME2024. These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.
PDF252October 13, 2025