LongCat-Next: モダリティを離散トークンとして語彙化する

要旨

主流のNext-Token Prediction（NTP）パラダイムは、離散的な自己回帰モデリングを通じて大規模言語モデルの成功を牽引してきました。しかし、現在のマルチモーダルシステムは依然として言語中心であり、非言語モダリティを外部付属物として扱うことが多く、断片化されたアーキテクチャと最適ではない統合を招いています。この制限を超えるため、我々はDiscrete Native Autoregressive（DiNA）を提案します。これはマルチモーダル情報を共有離散空間内で表現し、モダリティ間で一貫した原理的な自己回帰モデリングを可能にする統一フレームワークです。中核となる革新が、任意解像度でトークン化とデトークン化を実行するDiscrete Native Any-resolution Visual Transformer（dNaViT）です。これは連続的な視覚信号を階層的な離散トークンに変換します。この基盤に立脚して、我々はLongCat-Nextを開発しました。これはテキスト、視覚、音声を単一の自己回帰目的の下で処理し、モダリティ特有の設計を最小化したネイティブマルチモーダルモデルです。産業強度の基盤モデルとして、単一フレームワーク内で「見る」「描く」「話す」能力に優れ、多様なマルチモーダルベンチマークで強力な性能を達成します。特にLongCat-Nextは、理解タスクにおける離散視覚モデリングの長年の性能限界に挑み、理解と生成の対立を効果的に調和させる統一的なアプローチを提供します。ネイティブマルチモーダリティへの試みとして、LongCat-Nextとそのトークナイザーをオープンソース化し、コミュニティのさらなる研究発展を促進することを期待します。GitHub: https://github.com/meituan-longcat/LongCat-Next

English

The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next

LongCat-Next: モダリティを離散トークンとして語彙化する

LongCat-Next: Lexicalizing Modalities as Discrete Tokens

要旨

Support