SenseNova-U1：以NEO-unify架構統一多模態理解與生成

摘要

近年来，大型视觉语言模型（VLM）仍受制于一个根深蒂固的二元对立：理解与生成被视为两个独立问题，导致架构碎片化、级联流水线以及表征空间错位。我们认为，这种割裂不仅是工程实现上的缺陷，更是一种结构性的局限，阻碍了原生多模态智能的涌现。为此，我们提出SenseNova-U1——一种建立在NEO-unify基础上的原生统一多模态范式，其中理解与生成作为单一底层过程的协同视角共同演化。我们发布了两个原生统一变体：SenseNova-U1-8B-MoT和SenseNova-U1-A3B-MoT，分别基于密集（8B）和混合专家（30B-A3B）理解基线构建。从第一性原理出发，这些模型在文本理解、视觉语言感知、知识推理、智能体决策和空间智能等任务上，与顶尖的纯理解型视觉语言模型不相上下。与此同时，它们展现出强大的语义一致性和视觉保真度，在常规或知识密集型任意到图像（X2I）合成、复杂文本丰富的图表生成以及交错视觉语言生成（无论是否采用思考模式）中均表现出色。除性能外，我们还详细展示了模型设计、数据预处理、预训练/后训练及推理策略，以支持社区研究。最后但同样重要的是，初步证据表明，我们的模型不仅能进行感知与生成，还在视觉-语言-动作（VLA）和世界模型（WM）场景中表现出强大能力。这指向一条更广阔的发展路线：模型不再在不同模态之间进行翻译，而是以原生方式跨模态思考与行动。多模态人工智能不再关乎连接独立系统，而是构建一个统一系统，并相信必要的能力会从内部涌现。

English

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.

SenseNova-U1：以NEO-unify架構統一多模態理解與生成

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

摘要

Support