ChatPaper.aiChatPaper

重新审视视觉中心推理泛化中长链式思维的必要性

Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

November 27, 2025
作者: Yifan Du, Kun Zhou, Yingqian Min, Yue Ling, Wayne Xin Zhao, Youbin Wu
cs.AI

摘要

我们研究了不同思维链设计如何影响视觉语言模型获取可泛化的视觉推理能力。尽管思维链数据(尤其是长链或视觉化思维链如"图像思维")已被广泛用于监督中间推理过程,但其具体设计为何有效、何种设计能真正支持可泛化推理仍不明确。为系统评估这一问题,我们采用受控的迷宫求解基准测试:该场景的推理规则完全基于视觉,难度可通过网格尺寸调节,且所有中间步骤均可自动生成。基于Qwen2.5-VL-7B模型的标准SFT后RL训练流程,我们比较了三种代表性思维链格式:语言思维链、定位思维链(含空间坐标轨迹)和视觉思维链(含图像操作)。实验表明:视觉化/长链思维链主要加速收敛但未提升最终性能上限;仅含必要定位步骤的简洁思维链优于长链轨迹;尤为重要的是,仅保留最简定位结果的思维链在不同迷宫尺寸间泛化能力最佳。我们进一步在其他视觉中心任务上验证了这些发现。这些结果揭示了"少即是多"效应,为构建更具泛化能力的视觉推理SFT数据集提供了实践指导。
English
We study how different Chain-of-Thought (CoT) designs affect the acquisition of the generalizable visual reasoning ability in vision-language models (VLMs). While CoT data, especially long or visual CoT such as "think with image", has been widely used to supervise intermediate reasoning, it remains unclear why specific CoT designs help and which ones truly support generalizable reasoning. To systematically evaluate this, we focus on a controlled maze-solving benchmark where reasoning rules are fully visual, difficulty can be tuned by grid size, and all the intermediate steps can be automatically generated. Using Qwen2.5-VL-7B under a standard SFT-then-RL pipeline, we compare three representative CoT formats: Language CoT, Grounding CoT (with spatial coordinate trajectories), and Visual CoT (with image manipulations). Our experiments reveal that visual and longer CoT mainly accelerate convergence but do not lift the final performance ceiling; concise CoT containing only essential grounding steps outperforms longer traces; and, strikingly, CoT retaining only the minimal grounding results generalizes best across different maze sizes. We further validate these insights on other vision-centric tasks. These findings highlight a "short is long" effect and provide practical guidance for constructing more generalizable SFT datasets for visual reasoning.
PDF51December 4, 2025