ThinkMorph:多模態交錯思維鏈推理中的湧現特性
ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
October 30, 2025
作者: Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, Yu Cheng
cs.AI
摘要
多模態推理需要語言與視覺之間的迭代協調,然而何謂有意義的交錯思維鏈仍不明確。我們主張文本與圖像思維應作為互補而非同構的模態,共同推進推理進程。基於此原則,我們構建了ThinkMorph——一個通過在2.4萬條高質量交錯推理軌跡上微調的統一模型,這些數據涵蓋了不同視覺參與度的任務。ThinkMorph學會生成漸進式的文本-圖像推理步驟,既能具體操縱視覺內容,又能保持連貫的語言邏輯。該模型在視覺中心基準測試上實現大幅提升(較基礎模型平均提高34.7%),並能泛化至領域外任務,表現媲美甚至超越規模更大、參數專有的視覺語言模型。除性能提升外,ThinkMorph展現出湧現的多模態智能,包括未經訓練的視覺操縱技能、推理模式的自適應切換,以及通過多樣化多模態思維實現更優的測試時擴展能力。這些發現為表徵統一多模態推理模型的湧現能力指明了富有前景的研究方向。
English
Multimodal reasoning requires iterative coordination between language and
vision, yet it remains unclear what constitutes a meaningful interleaved chain
of thought. We posit that text and image thoughts should function as
complementary, rather than isomorphic, modalities that mutually advance
reasoning. Guided by this principle, we build ThinkMorph, a unified model
fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with
varying visual engagement. ThinkMorph learns to generate progressive text-image
reasoning steps that concretely manipulate visual content while maintaining
coherent verbal logic. It delivers large gains on vision-centric benchmarks
(averaging 34.7% over the base model) and generalizes to out-of-domain tasks,
matching or surpassing larger and proprietary VLMs. Beyond performance,
ThinkMorph exhibits emergent multimodal intelligence, including unseen visual
manipulation skills, adaptive switching between reasoning modes, and better
test-time scaling through diversified multimodal thoughts.These findings
suggest promising directions for characterizing the emergent capabilities of
unified models for multimodal reasoning.