ChatPaper.aiChatPaper

MMFineReason:透過開放式資料中心方法彌合多模態推理鴻溝

MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

January 29, 2026
作者: Honglin Lin, Zheng Liu, Yun Zhu, Chonghan Qin, Juekai Lin, Xiaoran Shang, Conghui He, Wentao Zhang, Lijun Wu
cs.AI

摘要

近期視覺語言模型(VLM)的突破顯著推動了視覺推理領域的發展。然而,開源VLM仍落後於專有系統,主因在於缺乏高品質的推理數據。現有數據集對STEM圖表與視覺謎題等挑戰性領域的覆蓋有限,且缺乏能激發強推理能力所需的連貫長式思維鏈註解。為彌合此差距,我們推出MMFineReason——一個包含180萬樣本與51億解題標記的大規模多模態推理數據集,其高品質推理註解源自Qwen3-VL-235B-A22B-Thinking的知識蒸餾。該數據集通過系統化三階段流程構建:(1) 大規模數據收集與標準化,(2) 思維鏈原理生成,(3) 基於推理品質與難度感知的綜合篩選。最終數據集涵蓋STEM問題、視覺謎題、遊戲及複雜圖表,每個樣本均附有視覺化錨定的推理軌跡。我們以MMFineReason對Qwen3-VL-Instruct進行微調,開發出MMFineReason-2B/4B/8B版本。這些模型在其參數規模級別創下最新性能紀錄:MMFineReason-4B成功超越Qwen3-VL-8B-Thinking,而MMFineReason-8B甚至優於Qwen3-VL-30B-A3B-Thinking,並逼近Qwen3-VL-32B-Thinking,展現卓越的參數效率。關鍵在於,我們透過難度感知篩選策略發現「少即是多」現象:僅7%(12.3萬樣本)的子集即可達到與完整數據集相當的性能。尤為重要的是,我們揭示了以推理為導向的數據組合能同步提升通用能力的協同效應。
English
Recent advances in Vision Language Models (VLMs) have driven significant progress in visual reasoning. However, open-source VLMs still lag behind proprietary systems, largely due to the lack of high-quality reasoning data. Existing datasets offer limited coverage of challenging domains such as STEM diagrams and visual puzzles, and lack consistent, long-form Chain-of-Thought (CoT) annotations essential for eliciting strong reasoning capabilities. To bridge this gap, we introduce MMFineReason, a large-scale multimodal reasoning dataset comprising 1.8M samples and 5.1B solution tokens, featuring high-quality reasoning annotations distilled from Qwen3-VL-235B-A22B-Thinking. The dataset is established via a systematic three-stage pipeline: (1) large-scale data collection and standardization, (2) CoT rationale generation, and (3) comprehensive selection based on reasoning quality and difficulty awareness. The resulting dataset spans STEM problems, visual puzzles, games, and complex diagrams, with each sample annotated with visually grounded reasoning traces. We fine-tune Qwen3-VL-Instruct on MMFineReason to develop MMFineReason-2B/4B/8B versions. Our models establish new state-of-the-art results for their size class. Notably, MMFineReason-4B succesfully surpasses Qwen3-VL-8B-Thinking, and MMFineReason-8B even outperforms Qwen3-VL-30B-A3B-Thinking while approaching Qwen3-VL-32B-Thinking, demonstrating remarkable parameter efficiency. Crucially, we uncover a "less is more" phenomenon via our difficulty-aware filtering strategy: a subset of just 7\% (123K samples) achieves performance comparable to the full dataset. Notably, we reveal a synergistic effect where reasoning-oriented data composition simultaneously boosts general capabilities.
PDF413January 31, 2026