MaineCoon：追求实时视听社交世界模型

摘要

随着全球视频内容越来越多地以社交互动为目的在社交平台上消费，专为社交世界构建的视频生成模型虽至关重要，却在以往研究中被严重忽视。在本工作中，我们界定了社交世界模型的定位，并构建了一个原型模型作为实现这一目标的第一步。尽管以往的世界模型成功模拟了物理环境或游戏世界探索，但它们本质上仍与以人为中心的社交动态相脱节。为填补这一空白，我们提出首个实时音视频自回归模型MaineCoon，该模型拥有220亿参数，能够实现实时流式生成和亚秒级交互，在单块GPU上创下了高达47.5 FPS的帧率纪录。据我们所知，MaineCoon也是首个专为社交互动应用优化的实时音视频生成模型。为实现高效稳定的训练，我们在MaineCoon中引入了多项创新技术，包括自重采样、跨模态表征对齐、领域感知偏好优化以及强化在线策略蒸馏。我们还设计了首个智能体流式推理框架，支持千秒级甚至更长时间的生成，并通过智能体缓存管理和提示规划来缓解漂移问题。这些创新显著加速了训练过程，同时优化了实时推理性能。我们相信，这项工作不仅为高质量、低延迟、长时域音视频自回归模型树立了新的最佳性能基准，更指出了下一代AI原生社交平台所需的范式转变方向。

English

As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.