LongCat-Next：將模態詞彙化為離散標記

摘要

當前主流的下一詞元預測（NTP）範式通過離散自迴歸建模推動了大語言模型的成功。然而，現有多模態系統仍以語言為核心，常將非語言模態視為外部附件，導致架構碎片化與整合欠佳。為突破此限制，我們提出離散原生自迴歸（DiNA）框架，將多模態信息表徵於共享離散空間，實現跨模態一致且原理統一的自迴歸建模。其核心創新在於離散原生任意分辨率視覺轉換器（dNaViT），該組件支持任意分辨率的詞元化與逆詞元化操作，將連續視覺信號轉化為層級化離散詞元。基於此，我們開發出原生多模態模型LongCat-Next，該模型以極簡的模態特定設計，在單一自迴歸目標下統一處理文本、視覺與音頻數據。作為工業級基礎模型，它能在單一框架內卓越完成視覺理解、圖像生成及語音對話任務，在多模態基準測試中表現強勁。特別值得注意的是，LongCat-Next突破了離散視覺建模在理解任務上長期存在的性能瓶頸，並提供統一方法有效調和理解與生成之間的矛盾。作為邁向原生多模態的嘗試，我們開源了LongCat-Next模型及其詞元化工具，以期推動社區進一步研究與發展。GitHub：https://github.com/meituan-longcat/LongCat-Next

English

The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next

LongCat-Next：將模態詞彙化為離散標記

LongCat-Next: Lexicalizing Modalities as Discrete Tokens

摘要

Support