LongCat-Next: 양상을 이산 토큰으로 어휘화하기

초록

현재 널리 사용되는 다음 토큰 예측(NTP) 패러다임은 이산 자기회귀 모델링을 통해 대규모 언어 모델의 성공을 견인해왔습니다. 그러나 현대의 다중모달 시스템은 여전히 언어 중심적이며, 비언어적 모달리티를 외부 부가물로 취급함으로써 파편화된 아키텍처와 최적화되지 않은 통합을 초래하는 경우가 많습니다. 이러한 한계를 극복하기 위해 우리는 다중모달 정보를 공유된 이산 공간 내에서 표현함으로써 모달리티 간 일관되고 원칙적인 자기회귀 모델링을 가능하게 하는 통합 프레임워크인 Discrete Native Autoregressive(DiNA)를 소개합니다. 핵심 혁신은 임의의 해상도에서 토큰화와 역토큰화를 수행하여 연속적인 시각 신호를 계층적 이산 토큰으로 변환하는 Discrete Native Any-resolution Visual Transformer(dNaViT)입니다. 이를 기반으로 우리는 최소한의 모달리티 특화 설계로 텍스트, 시각, 오디오를 단일 자기회귀 목표 하에 처리하는 네이티브 다중모달 모델인 LongCat-Next를 개발했습니다. 산업 수준의 파운데이션 모델로서 이는 단일 프레임워크 내에서 보기, 그리기, 말하기 작업에서 탁월한 성능을 발휘하며 다양한 다중모달 벤치마크에서 강력한 성과를 기록합니다. 특히 LongCat-Next는 이해 작업에서 이산 시각 모델링의 오랜 성능 한계를 해결하고, 이해와 생성 간의 갈등을 효과적으로 조화시키는 통합 접근법을 제공합니다. 네이티브 다중모달리티를 위한 시도로, 우리는 LongCat-Next와 해당 토크나이저를 오픈소스로 공개하여 커뮤니티의 추가 연구 개발을 촉진하고자 합니다. GitHub: https://github.com/meituan-longcat/LongCat-Next

English

The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next

LongCat-Next: 양상을 이산 토큰으로 어휘화하기

LongCat-Next: Lexicalizing Modalities as Discrete Tokens

초록

Support