ThinkMorph: 멀티모달 인터리브드 체인 오브 씽킹에서 나타나는 창발적 특성

초록

다중모달 추론에는 언어와 시각 간의 반복적 조정이 필요하지만, 무엇이 의미 있는 교차형 사고 사슬을 구성하는지 여전히 명확하지 않습니다. 우리는 텍스트와 이미지 사고가 동형이 아닌 상호 보완적 양식으로 작동하며 추론을 상호 발전시켜야 한다고 가정합니다. 이 원칙에 따라 다양한 시각적 참여도를 가진 과제들을 아우르는 24K 고품질 교차 추론 궤적으로 미세 조정된 통합 모델 ThinkMorph를 구축했습니다. ThinkMorph는 일관된 언어적 논리를 유지하면서 시각적 내용을 구체적으로 조작하는 점진적인 텍스트-이미지 추론 단계를 생성하도록 학습합니다. 이 모델은 시각 중심 벤치마크에서 기본 모델 대비 평균 34.7%의 큰 성능 향상을 보이며, 외부 영역 과제로도 일반화되어 더 크고 독점적인 VLM을 능가하거나 버금가는 성과를 냅니다. 성능 이상으로 ThinkMorph는 새로운 시각 조작 기술, 추론 모드 간 적응형 전환, 다양화된 다중모달 사고를 통한 향상된 테스트 시간 확장성을 포함한 창발적 다중모달 지능을 나타냅니다. 이러한 발견들은 다중모달 추론을 위한 통합 모델의 창발적 능력 특성화에 유망한 방향을 제시합니다.

English

Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.

ThinkMorph: 멀티모달 인터리브드 체인 오브 씽킹에서 나타나는 창발적 특성

ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

초록

Support