ChatPaper.aiChatPaper

长时沉淀的思考:大规模提炼组合式视觉推理链

Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale

November 7, 2025
作者: David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu, Prithviraj Ammanabrolu, Hyunwoo Kim, Yuan-Hong Liao, Yejin Choi
cs.AI

摘要

近期多模态推理的进展主要依赖于未公开数据集和专有数据合成方案,如何系统化构建大规模视觉中心推理数据集——特别是针对超越视觉数学的任务——仍存在开放性问题。本研究提出了一种新型推理数据生成框架,涵盖多样化技能与复杂度层级,包含超过100万道高质量合成视觉中心问题。该数据集同时包含偏好数据与支持离线和在线强化学习的指令提示。我们的合成框架分两阶段推进:(1)规模化;(2)复杂化。通过利用视觉语言模型和推理大语言模型的两阶段流程合成推理轨迹,为视觉语言模型生成思维链轨迹,捕捉前沿推理模型中丰富的多样化认知行为。值得注意的是,实验表明基于我们数据微调的Qwen2.5-VL-7B模型在所有评估的视觉中心基准测试中均超越开源基线模型,甚至在V* Bench、CV-Bench和MMStar-V上优于MiMo-VL-7B-RL等强封闭数据模型。最令人惊讶的是,尽管完全以视觉为中心,我们的数据在纯文本推理(MMLU-Pro)和音频推理(MMAU)任务上展现出正向迁移能力。同样,在未包含视频或具身视觉数据的情况下,我们在单证据具身问答基准(NiEH)评估中观察到显著提升。最后,我们利用该数据分析了完整视觉语言模型后训练流程。实证分析表明:(i)基于含非线性推理轨迹的高质量数据进行监督微调是在线强化学习有效的关键;(ii)分阶段离线强化学习可匹配在线强化学习性能同时降低计算需求;(iii)对高质量数据精心实施监督微调能显著提升跨领域、跨模态的迁移能力。
English
Recent progress in multimodal reasoning has been driven largely by undisclosed datasets and proprietary data synthesis recipes, leaving open questions about how to systematically build large-scale, vision-centric reasoning datasets, particularly for tasks that go beyond visual math. In this work, we introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions. The dataset also includes preference data and instruction prompts supporting both offline and online RL. Our synthesis framework proceeds in two stages: (1) scale; and (2) complexity. Reasoning traces are then synthesized through a two-stage process that leverages VLMs and reasoning LLMs, producing CoT traces for VLMs that capture the richness and diverse cognitive behaviors found in frontier reasoning models. Remarkably, we show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks, and even surpasses strong closed-data models such as MiMo-VL-7B-RL on V* Bench, CV-Bench and MMStar-V. Perhaps most surprising, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro) and audio reasoning (MMAU), demonstrating its effectiveness. Similarly, despite not containing videos or embodied visual data, we observe notable gains when evaluating on a single-evidence embodied QA benchmark (NiEH). Finally, we use our data to analyze the entire VLM post-training pipeline. Our empirical analysis highlights that (i) SFT on high-quality data with non-linear reasoning traces is essential for effective online RL, (ii) staged offline RL matches online RL's performance while reducing compute demands, and (iii) careful SFT on high quality data can substantially improve out-of-domain, cross-modality transfer.
PDF72February 7, 2026