ChatPaper.aiChatPaper

DiffThinker:基于扩散模型的生成式多模态推理研究

DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

December 30, 2025
作者: Zefeng He, Xiaoye Qu, Yafu Li, Tong Zhu, Siyuan Huang, Yu Cheng
cs.AI

摘要

尽管当前的多模态大语言模型(MLLMs)在多模态推理领域取得了显著进展,但其推理过程仍以文本为核心,导致在处理复杂的长视野、视觉中心任务时表现欠佳。本文提出了一种创新的生成式多模态推理范式,并推出基于扩散模型的推理框架DiffThinker。该框架在概念层面将多模态推理重构为原生的图像到图像生成任务,从而在视觉中心任务中实现卓越的逻辑一致性与空间精确度。我们通过系统化对比DiffThinker与MLLMs,首次深入揭示了该范式的四大内在特性:高效性、可控性、原生并行性与协同性。在四大任务领域(序列规划、组合优化、约束满足和空间配置)上的大量实验表明,DiffThinker显著超越了包括GPT-5(+314.2%)、Gemini-3-Flash(+111.6%)在内的领先闭源模型,以及微调的Qwen3-VL-32B基线模型(+39.0%),这充分证明生成式多模态推理是视觉中心推理领域极具前景的研究方向。
English
While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains (sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration) demonstrate that DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2\%) and Gemini-3-Flash (+111.6\%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0\%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.
PDF173January 3, 2026