ChatPaper.aiChatPaper

UniCorn:透過自我生成監督邁向自我改進的統一多模態模型

UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

January 6, 2026
作者: Ruiyan Han, Zhen Fang, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, Yi Cao, Feng Zhao
cs.AI

摘要

儘管統一多模態模型(UMMs)在跨模態理解方面取得了顯著成功,但其利用內部知識實現高質量生成的能力仍存在明顯不足。我們將這種差異形式化定義為「傳導性失語」現象——模型能準確解讀多模態輸入,卻難以將這種理解轉化為忠實且可控的合成結果。為解決此問題,我們提出UniCorn,一種簡潔優雅的自我提升框架,無需外部數據或教師監督。通過將單一UMM劃分為提議者、求解者與評判者三個協作角色,UniCorn藉由自我對弈生成高質量互動,並採用認知模式重構將潛在理解提煉為顯式生成信號。為驗證多模態連貫性的修復效果,我們引入基於「文本→圖像→文本」重建循環的UniCycle基準測試。大量實驗表明,UniCorn在六個通用圖像生成基準上相較基礎模型實現全面且顯著的提升:在TIIF(73.8)、DPG(86.8)、CompBench(88.5)及UniCycle上達到SOTA性能,同時在WISE和OneIG上分別取得+5.0和+6.5的大幅增益。這些結果凸顯了我們的方法在保持強健理解能力的同時顯著增強文生圖生成效果,證明了全自監督優化框架對於統一多模態智能的可擴展性。
English
While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.
PDF252January 8, 2026