ChatPaper.aiChatPaper

SenseNova-U1:以NEO-unify架構統一多模態理解與生成

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

May 12, 2026
作者: Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, Yan Li, Yubo Wang, Zhijie Cao, Zhiqian Lin, Zhitao Yang, Zhongang Cai, Yuwei Niu, Yue Zhu, Bo Liu, Chengguang Lv, Haojia Yu, Haozhe Xie, Hongli Wang, Jianan Fan, Jiaqi Li, Jiefan Lu, Jingcheng Ni, Junxiang Xu, Kaihuan Liang, Lianqiang Shi, Linjun Dai, Linyan Wang, Oscar Qian, Peng Gao, Pengfei Liu, Qingping Sun, Rui Shen, Ruisi Wang, Shengnan Ma, Shuang Yang, Siyi Xie, Siying Li, Tianbo Zhong, Xiangli Kong, Xuanke Shi, Yang Gao, Yongqiang Yao, Yves Wang, Zhengqi Bai, Zhengyu Lin, Zixin Yin, Wenxiu Sun, Ruihao Gong, Quan Wang, Lewei Lu, Lei Yang, Ziwei Liu, Dahua Lin
cs.AI

摘要

近年来,大型视觉语言模型(VLM)仍受制于一个根深蒂固的二元对立:理解与生成被视为两个独立问题,导致架构碎片化、级联流水线以及表征空间错位。我们认为,这种割裂不仅是工程实现上的缺陷,更是一种结构性的局限,阻碍了原生多模态智能的涌现。为此,我们提出SenseNova-U1——一种建立在NEO-unify基础上的原生统一多模态范式,其中理解与生成作为单一底层过程的协同视角共同演化。我们发布了两个原生统一变体:SenseNova-U1-8B-MoT和SenseNova-U1-A3B-MoT,分别基于密集(8B)和混合专家(30B-A3B)理解基线构建。从第一性原理出发,这些模型在文本理解、视觉语言感知、知识推理、智能体决策和空间智能等任务上,与顶尖的纯理解型视觉语言模型不相上下。与此同时,它们展现出强大的语义一致性和视觉保真度,在常规或知识密集型任意到图像(X2I)合成、复杂文本丰富的图表生成以及交错视觉语言生成(无论是否采用思考模式)中均表现出色。除性能外,我们还详细展示了模型设计、数据预处理、预训练/后训练及推理策略,以支持社区研究。最后但同样重要的是,初步证据表明,我们的模型不仅能进行感知与生成,还在视觉-语言-动作(VLA)和世界模型(WM)场景中表现出强大能力。这指向一条更广阔的发展路线:模型不再在不同模态之间进行翻译,而是以原生方式跨模态思考与行动。多模态人工智能不再关乎连接独立系统,而是构建一个统一系统,并相信必要的能力会从内部涌现。
English
Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.
PDF1141May 14, 2026