ChatPaper.aiChatPaper

UniT:统一多模态思维链测试时扩展

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

February 12, 2026
作者: Leon Liangyu Chen, Haoyu Ma, Zhipeng Fan, Ziqi Huang, Animesh Sinha, Xiaoliang Dai, Jialiang Wang, Zecheng He, Jianwei Yang, Chunyuan Li, Junzhe Sun, Chu Wang, Serena Yeung-Levy, Felix Juefei-Xu
cs.AI

摘要

统一模型能够在单一架构内同时处理多模态理解与生成任务,但其通常采用单次前向推理而缺乏对输出的迭代优化。许多多模态任务(尤其是涉及复杂空间组合、多对象交互或动态指令的场景)需要分解指令、验证中间结果并进行迭代修正。虽然测试时扩展技术已证明通过分配额外推理算力进行迭代式推理可显著提升语言模型性能,但将该范式拓展至统一多模态模型仍存在挑战。我们提出UniT框架——一种支持多模态思维链测试时扩展的方法,使统一模型能够进行多轮推理、验证与优化。UniT融合了智能体数据合成、统一模型训练与弹性测试时推理,可激发包括验证、子目标分解和内容记忆在内的认知行为。核心发现包括:(1)基于短推理轨迹训练的模型在测试时能泛化至更长推理链;(2)序列化思维链推理相比并行采样具有更优的可扩展性与计算效率;(3)生成与编辑轨迹训练能提升分布外视觉推理性能。这些成果确立了多模态测试时扩展作为推动统一模型生成与理解能力协同发展的有效范式。
English
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.
PDF131February 19, 2026