长猫新篇：将多模态信息词汇化为离散标记

摘要

当前主流的下一词元预测（NTP）范式通过离散自回归建模推动了大型语言模型的成功。然而，现有的多模态系统仍以语言为核心，往往将非语言模态视为外部附属，导致架构碎片化与融合不足。为突破这一局限，我们提出离散原生自回归框架（DiNA），该统一框架将多模态信息表征于共享离散空间，实现跨模态的一致性与原则性自回归建模。其核心创新是离散原生任意分辨率视觉变换器（dNaViT），可在任意分辨率下执行标记化与逆标记化操作，将连续视觉信号转化为层次化离散标记。基于此，我们开发了原生多模态模型LongCat-Next，该模型以单一自回归目标处理文本、视觉和音频信号，最大程度减少模态特定设计。作为工业级基础模型，它能在统一框架内实现看、画、说等多模态能力，在广泛的多模态基准测试中表现优异。特别值得一提的是，LongCat-Next突破了离散视觉建模在理解任务上长期存在的性能瓶颈，并为有效协调理解与生成之间的冲突提供了统一解决方案。作为迈向原生多模态的尝试，我们开源了LongCat-Next及其标记器，以期推动社区进一步研究与发展。GitHub地址：https://github.com/meituan-longcat/LongCat-Next

English

The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next

长猫新篇：将多模态信息词汇化为离散标记

LongCat-Next: Lexicalizing Modalities as Discrete Tokens

摘要

Support