MaineCoon: 실시간 시청각 사회적 세계 모델 추구

초록

전 세계 영상 콘텐츠가 점점 더 상호작용적인 소셜 목적으로 소셜 플랫폼에서 소비됨에 따라, 소셜 환경을 위해 구축된 영상 생성 모델의 중요성은 커지고 있지만 기존 연구에서는 largely 간과되어 왔습니다. 본 연구에서는 소셜 월드 모델의 위치를 정의하고, 이를 위한 첫 단계로서 프로토타입 모델을 구축합니다. 기존의 월드 모델들은 물리적 환경이나 게임 세계 탐험을 성공적으로 시뮬레이션했지만, 인간 중심의 사회적 역학 관계에서 근본적으로 분리되어 있습니다. 이러한 격차를 해소하고 소셜 월드 모델의 첫 단계를 제시하기 위해, 우리는 단일 GPU에서 최대 47.5 FPS의 기록적인 프레임 속도로 실시간 스트리밍 생성과 서브초 단위 상호작용이 가능한 220억 개의 파라미터를 가진 최초의 실시간 오디오-비주얼 자기회귀 모델인 MaineCoon을 선보입니다. 현재까지 알려진 바로는, MaineCoon은 소셜 상호작용 애플리케이션에 특화되어 최적화된 최초의 실시간 오디오-비주얼 생성 모델이기도 합니다. 효율적이고 안정적인 훈련을 위해, 우리는 자기 재표집, 교차 모달 표현 정렬, 도메인 인식 선호도 최적화, 강화된 온라인 정책 증류(ROPD) 등 여러 혁신적인 기법을 MaineCoon에 도입했습니다. 또한, 수천 초 단위 혹은 그 이상의 생성을 지원하면서 에이전트 캐시 관리와 프롬프트 계획을 통해 드리프트를 완화하는 최초의 에이전트 기반 스트리밍 추론 프레임워크를 설계했습니다. 이러한 혁신은 훈련을 크게 가속화하는 동시에 실시간 추론 성능을 최적화합니다. 본 연구는 고품질, 저지연, 장시간 오디오-비주얼 자기회귀 모델에 대한 새로운 최첨단(SOTA) 성능 벤치마크를 설정할 뿐만 아니라, 차세대 AI 기반 소셜 플랫폼에 필요한 패러다임 전환을 제시한다고 믿습니다.

English

As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.