SenseNova-U1：基于NEO-unify架构的统一多模态理解与生成

摘要

近期的大型视觉语言模型仍然受困于一个持续存在的二元对立：理解与生成被视为两个独立的问题，导致架构碎片化、流水线级联以及表示空间错位。我们认为，这种割裂不仅是工程实现上的缺陷，更是一种结构性局限，阻碍了原生多模态智能的涌现。为此，我们提出SenseNova-U1，一种基于NEO-unify构建的原生统一多模态范式，其中理解与生成演化为一个底层过程的协同视角。我们发布了两个原生统一变体：SenseNova-U1-8B-MoT和SenseNova-U1-A3B-MoT，分别基于稠密型（8B）和混合专家型（30B-A3B）理解基线模型构建。它们从第一性原理出发进行设计，在文本理解、视觉语言感知、知识推理、智能体决策以及空间智能等任务上，能够与顶尖的纯理解型视觉语言模型相媲美。同时，它们在传统或知识驱动的任意到图像合成、复杂文本丰富的信息图生成，以及交错式视觉语言生成等任务中，展现出强大的语义一致性和视觉保真度，无论是否采用思考模式。除性能表现外，我们还详细介绍了模型设计、数据预处理、预训练/后训练以及推理策略，以支持社区研究。最后但同样重要的是，初步实验证据表明，我们的模型能够超越感知与生成范畴，在视觉-语言-动作和世界模型等场景中表现出色。这指向一条更宏大的路线图：模型不再在不同模态之间进行翻译，而是以原生方式在它们之间思考与行动。多模态AI不再是关于连接独立系统，而是构建一个统一的系统，并相信所需的能力将从内部自然涌现。

English

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.

SenseNova-U1：基于NEO-unify架构的统一多模态理解与生成

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

摘要

Support