ChatPaper.aiChatPaper

ThinkMorph:多模态交错思维链推理中的涌现特性

ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

October 30, 2025
作者: Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, Yu Cheng
cs.AI

摘要

多模态推理需要语言与视觉的迭代协同,然而目前尚不清楚何种交织的思维链具有实际意义。我们提出,文本与图像思维应作为互补而非同构的模态,共同推进推理进程。基于此原则,我们构建了ThinkMorph模型——通过在24K高质量交织推理轨迹上微调的通用模型,这些数据涵盖不同视觉参与度的任务。ThinkMorph能够生成渐进式的文本-图像推理步骤,在保持连贯语言逻辑的同时实现对视觉内容的具体操控。该模型在视觉中心基准测试中实现显著提升(较基础模型平均提升34.7%),并能泛化至领域外任务,达到甚至超越规模更大、参数闭源的视觉语言模型水平。除性能优势外,ThinkMorph展现出新兴的多模态智能特征,包括未经训练的视觉操控技能、自适应推理模式切换能力,以及通过多样化多模态思维实现更优的测试时扩展性。这些发现为探索统一多模态推理模型的新兴能力表征指明了富有前景的方向。
English
Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.
PDF807December 2, 2025