ChatPaper.aiChatPaper

UniCorn:通过自生成监督实现自我提升的统一多模态模型

UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

January 6, 2026
作者: Ruiyan Han, Zhen Fang, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, Yi Cao, Feng Zhao
cs.AI

摘要

尽管统一多模态模型(UMMs)在跨模态理解方面取得了显著成就,但其利用内部知识实现高质量生成的能力仍存在明显不足。我们将这种差异形式化为传导性失语现象——模型能准确解读多模态输入,却难以将这种理解转化为忠实可控的合成结果。为此,我们提出UniCorn框架,该框架通过简洁优雅的自改进机制,无需外部数据或教师监督即可实现提升。通过将单一UMM划分为提议者、求解者与评判者三个协作角色,UniCorn借助自我博弈生成高质量交互,并采用认知模式重构将潜在理解蒸馏为显式生成信号。为验证多模态连贯性的修复效果,我们构建了基于文本→图像→文本重建循环的UniCycle基准测试。大量实验表明,UniCorn在六大通用图像生成基准上相较基线模型实现了全面且显著的提升:在TIIF(73.8)、DPG(86.8)、CompBench(88.5)及UniCycle上达到SOTA性能,同时在WISE和OneIG上分别取得+5.0和+6.5的显著增益。这些结果印证了我们的方法在保持强劲理解能力的同时显著提升文本到图像生成质量,证明了全自监督优化框架对于统一多模态智能的可扩展性。
English
While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.
PDF252January 8, 2026