MaineCoon：追求实时的视听社交世界模型

摘要

隨著全球多數影音內容在社交平台上以互動社交為目的被消費，專為社交世界設計的影片生成模型雖然重要，卻在過去研究中普遍被忽略。本研究定義了社交世界模型的定位，並建立原型模型作為邁向此目標的第一步。過往的世界模型成功模擬物理環境或遊戲世界的探索，但本質上仍與以人類為核心的社交動態脫節。為填補此差距，我們提出MaineCoon——首個即時音視自回歸模型，擁有220億參數，具備即時串流生成與次秒級互動能力，並在單一GPU上創下高達47.5 FPS的破紀錄幀率。據我們所知，MaineCoon也是首個專為社交互動應用優化的即時音視生成模型。為實現高效穩定的訓練，我們在MaineCoon中引入多項新技術，包括自我重採樣、跨模態表徵對齊、領域感知偏好優化，以及強化在線策略蒸餾（ROPD）。我們亦設計首個代理式串流推理框架，支援千秒級甚至更長生成，並透過代理快取管理與提示規劃來緩解漂移。這些創新顯著加速訓練，同時優化即時推理效能。我們相信，本研究不僅為高品質、低延遲、長時域的音視自回歸模型設立了新的最佳效能基準，更指出了下一代AI原生社交平台所需的典範轉移方向。

English

As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.