ARES:基于难度感知的令牌级熵调控的多模态自适应推理
ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping
October 9, 2025
作者: Shuang Chen, Yue Guo, Yimeng Ye, Shijue Huang, Wenbo Hu, Haoxi Li, Manyuan Zhang, Jiayu Chen, Song Guo, Nanyun Peng
cs.AI
摘要
近期,多模态大型推理模型(MLRMs)的显著进展大幅提升了其解决复杂文本与视觉任务的能力。然而,这些模型在处理简单问题时往往过度思考,产生冗长且不必要的推理轨迹,而在面对挑战性问题时则探索不足,导致错失解决方案。为应对这一不平衡现象,我们提出了ARES,一个统一的开源自适应推理框架,能够根据任务难度动态分配探索力度。我们的方法基于两项关键实证发现:(i) 尽管单令牌熵存在噪声,但高窗口熵(HWE)令牌(在滑动窗口下平均的令牌级熵)能可靠捕捉推理关键时刻;(ii) 减少HWE使用有利于解决简单问题,而增加HWE则是解决难题的关键。基于这些洞见,ARES引入了一个两阶段训练流程。在自适应冷启动阶段,我们精心挑选了多模态与文本数据,并配以与问题难度成比例的推理轨迹,使模型初步具备难度感知能力。在第二阶段,我们开发了自适应熵策略优化(AEPO),利用HWE令牌作为探索触发器来决定何时探索,并通过动态KL控制的分层熵奖励来决定探索的深度。大量实验表明,ARES在多样化的数学、逻辑及多模态基准测试中均实现了卓越的性能与推理效率,同时在显著降低推理成本的情况下,缩小了与领先商业系统的差距。
English
Recent advances in multimodal large reasoning models (MLRMs) have
substantially improved their ability to solve complex textual and visual tasks.
However, these models tend to overthink on simple problems, producing
unnecessarily lengthy reasoning traces, while under-exploring on challenging
ones, leading to missed solutions. To address this imbalance, we propose ARES,
a unified open-source framework for adaptive reasoning that dynamically
allocates exploration effort based on task difficulty. Our approach is
motivated by two key empirical findings: (i) while single-token entropy is
noisy, high window-entropy (HWE) tokens (token-level entropies averaged under a
sliding window) can reliably capture reasoning-critical moments; and (ii)
reducing HWE usage benefits easy problems, while increasing it is essential for
solving hard ones. Building on these insights, ARES introduces a two-stage
training pipeline. In the Adaptive Cold-Start stage, we curate multimodal and
textual data paired with reasoning traces of length proportional to problem
difficulty, equipping the model with initial difficulty awareness. In the
second stage, we develop Adaptive Entropy Policy Optimization (AEPO), which
uses HWE tokens as exploration triggers to decide when to explore, and a
hierarchical entropy reward with dynamic KL control to decide how much to
explore. Extensive experiments demonstrate that ARES achieves superior
performance and reasoning efficiency across diverse mathematical, logical, and
multimodal benchmarks, while closing the gap to leading commercial systems
under significantly lower inference costs.