SenseNova-U1:基于NEO-unify架构的统一多模态理解与生成
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
May 12, 2026
作者: Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, Yan Li, Yubo Wang, Zhijie Cao, Zhiqian Lin, Zhitao Yang, Zhongang Cai, Yuwei Niu, Yue Zhu, Bo Liu, Chengguang Lv, Haojia Yu, Haozhe Xie, Hongli Wang, Jianan Fan, Jiaqi Li, Jiefan Lu, Jingcheng Ni, Junxiang Xu, Kaihuan Liang, Lianqiang Shi, Linjun Dai, Linyan Wang, Oscar Qian, Peng Gao, Pengfei Liu, Qingping Sun, Rui Shen, Ruisi Wang, Shengnan Ma, Shuang Yang, Siyi Xie, Siying Li, Tianbo Zhong, Xiangli Kong, Xuanke Shi, Yang Gao, Yongqiang Yao, Yves Wang, Zhengqi Bai, Zhengyu Lin, Zixin Yin, Wenxiu Sun, Ruihao Gong, Quan Wang, Lewei Lu, Lei Yang, Ziwei Liu, Dahua Lin
cs.AI
摘要
近期的大型视觉语言模型仍然受困于一个持续存在的二元对立:理解与生成被视为两个独立的问题,导致架构碎片化、流水线级联以及表示空间错位。我们认为,这种割裂不仅是工程实现上的缺陷,更是一种结构性局限,阻碍了原生多模态智能的涌现。为此,我们提出SenseNova-U1,一种基于NEO-unify构建的原生统一多模态范式,其中理解与生成演化为一个底层过程的协同视角。我们发布了两个原生统一变体:SenseNova-U1-8B-MoT和SenseNova-U1-A3B-MoT,分别基于稠密型(8B)和混合专家型(30B-A3B)理解基线模型构建。它们从第一性原理出发进行设计,在文本理解、视觉语言感知、知识推理、智能体决策以及空间智能等任务上,能够与顶尖的纯理解型视觉语言模型相媲美。同时,它们在传统或知识驱动的任意到图像合成、复杂文本丰富的信息图生成,以及交错式视觉语言生成等任务中,展现出强大的语义一致性和视觉保真度,无论是否采用思考模式。除性能表现外,我们还详细介绍了模型设计、数据预处理、预训练/后训练以及推理策略,以支持社区研究。最后但同样重要的是,初步实验证据表明,我们的模型能够超越感知与生成范畴,在视觉-语言-动作和世界模型等场景中表现出色。这指向一条更宏大的路线图:模型不再在不同模态之间进行翻译,而是以原生方式在它们之间思考与行动。多模态AI不再是关于连接独立系统,而是构建一个统一的系统,并相信所需的能力将从内部自然涌现。
English
Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.