为高效视觉推理学习自适应推理路径

摘要

视觉推理模型（VRMs）近期通过融合视觉感知与语言推理展现出强大的跨模态推理能力。然而，这类模型常存在过度推理问题，即对任何任务都生成不必要的冗长推理链。我们将此问题归因于视觉推理中的推理路径冗余：多数视觉问题并不需要完整的推理流程。为此，我们提出自适应视觉推理框架AVR，将视觉推理分解为视觉感知、逻辑推理和答案应用三个认知功能，并支持模型动态选择完整格式、纯感知格式和直接答案三种响应模式。通过改进型分组相对策略优化算法FS-GRPO进行训练，AVR能在保证正确性的前提下选择最高效的推理格式。在多模态基准测试上的实验表明，AVR在保持整体精度的同时将token使用量降低50%-90%，尤其在感知密集型任务中效果显著。这些结果证明自适应视觉推理能有效缓解VRM的过度推理问题。代码与数据详见：https://github.com/RunRiotComeOn/AVR。

English

Visual reasoning models (VRMs) have recently shown strong cross-modal reasoning capabilities by integrating visual perception with language reasoning. However, they often suffer from overthinking, producing unnecessarily long reasoning chains for any tasks. We attribute this issue to Reasoning Path Redundancy in visual reasoning: many visual questions do not require the full reasoning process. To address this, we propose AVR, an adaptive visual reasoning framework that decomposes visual reasoning into three cognitive functions: visual perception, logical reasoning, and answer application. It further enables models to dynamically choose among three response formats: Full Format, Perception-Only Format, and Direct Answer. AVR is trained with FS-GRPO, an adaptation of Group Relative Policy Optimization that encourages the model to select the most efficient reasoning format while preserving correctness. Experiments on multiple vision-language benchmarks show that AVR reduces token usage by 50--90\% while maintaining overall accuracy, especially in perception-intensive tasks. These results demonstrate that adaptive visual reasoning can effectively mitigate overthinking in VRMs. Code and data are available at: https://github.com/RunRiotComeOn/AVR.