MMFineReason:通过开放式数据驱动方法弥合多模态推理鸿沟
MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods
January 29, 2026
作者: Honglin Lin, Zheng Liu, Yun Zhu, Chonghan Qin, Juekai Lin, Xiaoran Shang, Conghui He, Wentao Zhang, Lijun Wu
cs.AI
摘要
视觉语言模型(VLMs)的最新进展显著推动了视觉推理领域的发展。然而,开源VLMs仍落后于专有系统,这主要源于高质量推理数据的匮乏。现有数据集对STEM图表、视觉谜题等挑战性领域覆盖有限,且缺乏能够激发强推理能力所必需的一致、长链思维(CoT)标注。为弥补这一空白,我们推出了MMFineReason——一个包含180万样本、51亿解答令牌的大规模多模态推理数据集,其高质量推理标注源自Qwen3-VL-235B-A22B-Thinking的知识蒸馏。该数据集通过系统化的三阶段流程构建:(1)大规模数据收集与标准化;(2)CoT原理生成;(3)基于推理质量与难度感知的综合筛选。最终数据集涵盖STEM问题、视觉谜题、游戏及复杂图表,每个样本均配有视觉化推理轨迹标注。我们在MMFineReason上对Qwen3-VL-Instruct进行微调,开发出MMFineReason-2B/4B/8B版本。这些模型在其规模级别中创造了新的性能纪录:MMFineReason-4B成功超越Qwen3-VL-8B-Thinking,而MMFineReason-8B甚至优于Qwen3-VL-30B-A3B-Thinking,并逼近Qwen3-VL-32B-Thinking,展现出卓越的参数效率。关键发现是,通过难度感知过滤策略揭示了“少即是多”现象:仅7%(12.3万样本)的子集即可达到与完整数据集相当的性能。值得注意的是,我们还发现以推理为导向的数据组合能同步提升模型通用能力,产生协同效应。
English
Recent advances in Vision Language Models (VLMs) have driven significant progress in visual reasoning. However, open-source VLMs still lag behind proprietary systems, largely due to the lack of high-quality reasoning data. Existing datasets offer limited coverage of challenging domains such as STEM diagrams and visual puzzles, and lack consistent, long-form Chain-of-Thought (CoT) annotations essential for eliciting strong reasoning capabilities. To bridge this gap, we introduce MMFineReason, a large-scale multimodal reasoning dataset comprising 1.8M samples and 5.1B solution tokens, featuring high-quality reasoning annotations distilled from Qwen3-VL-235B-A22B-Thinking. The dataset is established via a systematic three-stage pipeline: (1) large-scale data collection and standardization, (2) CoT rationale generation, and (3) comprehensive selection based on reasoning quality and difficulty awareness. The resulting dataset spans STEM problems, visual puzzles, games, and complex diagrams, with each sample annotated with visually grounded reasoning traces. We fine-tune Qwen3-VL-Instruct on MMFineReason to develop MMFineReason-2B/4B/8B versions. Our models establish new state-of-the-art results for their size class. Notably, MMFineReason-4B succesfully surpasses Qwen3-VL-8B-Thinking, and MMFineReason-8B even outperforms Qwen3-VL-30B-A3B-Thinking while approaching Qwen3-VL-32B-Thinking, demonstrating remarkable parameter efficiency. Crucially, we uncover a "less is more" phenomenon via our difficulty-aware filtering strategy: a subset of just 7\% (123K samples) achieves performance comparable to the full dataset. Notably, we reveal a synergistic effect where reasoning-oriented data composition simultaneously boosts general capabilities.