ChatPaper.aiChatPaper

LongCat-Next:將模態詞彙化為離散標記

LongCat-Next: Lexicalizing Modalities as Discrete Tokens

March 29, 2026
作者: Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, Haozhe Zhao, Hong Liu, Hui Su, Jiaqi Zhang, Jiawei Wang, Jing Li, Kefeng Zhang, Manyuan Zhang, Minhao Jing, Peng Pei, Quan Chen, Taofeng Xue, Tongxin Pan, Xiaotong Li, Xiaoyang Li, Xiaoyu Zhao, Xing Hu, Xinyang Lin, Xunliang Cai, Yan Bai, Yan Feng, Yanjie Li, Yao Qiu, Yerui Sun, Yifan Lu, Ying Luo, Yipeng Mei, Yitian Chen, Yuchen Xie, Yufang Liu, Yufei Chen, Yulei Qian, Yuqi Peng, Zhihang Yu, Zhixiong Han, Changran Wang, Chen Chen, Dian Zheng, Fengjiao Chen, Ge Yang, Haowei Guo, Haozhe Wang, Hongyu Li, Huicheng Jiang, Jiale Hong, Jialv Zou, Jiamu Li, Jianping Lin, Jiaxing Liu, Jie Yang, Jing Jin, Jun Kuang, Juncheng She, Kunming Luo, Kuofeng Gao, Lin Qiu, Linsen Guo, Mianqiu Huang, Qi Li, Qian Wang, Rumei Li, Siyu Ren, Wei Wang, Wenlong He, Xi Chen, Xiao Liu, Xiaoyu Li, Xu Huang, Xuanyu Zhu, Xuezhi Cao, Yaoming Zhu, Yifei Cao, Yimeng Jia, Yizhen Jiang, Yufei Gao, Zeyang Hu, Zhenlong Yuan, Zijian Zhang, Ziwen Wang
cs.AI

摘要

當前主流的下一詞元預測(NTP)範式通過離散自迴歸建模推動了大語言模型的成功。然而,現有多模態系統仍以語言為核心,常將非語言模態視為外部附件,導致架構碎片化與整合欠佳。為突破此限制,我們提出離散原生自迴歸(DiNA)框架,將多模態信息表徵於共享離散空間,實現跨模態一致且原理統一的自迴歸建模。其核心創新在於離散原生任意分辨率視覺轉換器(dNaViT),該組件支持任意分辨率的詞元化與逆詞元化操作,將連續視覺信號轉化為層級化離散詞元。基於此,我們開發出原生多模態模型LongCat-Next,該模型以極簡的模態特定設計,在單一自迴歸目標下統一處理文本、視覺與音頻數據。作為工業級基礎模型,它能在單一框架內卓越完成視覺理解、圖像生成及語音對話任務,在多模態基準測試中表現強勁。特別值得注意的是,LongCat-Next突破了離散視覺建模在理解任務上長期存在的性能瓶頸,並提供統一方法有效調和理解與生成之間的矛盾。作為邁向原生多模態的嘗試,我們開源了LongCat-Next模型及其詞元化工具,以期推動社區進一步研究與發展。GitHub:https://github.com/meituan-longcat/LongCat-Next
English
The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next
PDF1173April 2, 2026