MaineCoon：リアルタイム音声・視覚社会世界モデルの追求

要旨

世界的な動画コンテンツの大部分が、インタラクティブなソーシャル目的でソーシャルプラットフォーム上で消費されるようになるにつれて、ソーシャルワールド向けに構築された動画生成モデルは重要であるが、これまでの研究ではほとんど見過ごされてきた。本研究では、ソーシャルワールドモデルの位置づけを定義し、その目標への第一歩としてプロトタイプモデルを構築する。これまでのワールドモデルは物理環境やゲームワールドの探索をうまくシミュレートするものの、人間中心の社会的ダイナミクスからは根本的に切り離されたままである。このギャップを埋めるために、ソーシャルワールドモデルへの第一歩として、22Bのパラメータを持ち、リアルタイムのストリーミング生成とサブ秒のインタラクションを可能にし、単一GPU上で最大47.5 FPSという記録的なフレームレートを達成する、初のリアルタイム音声-ビジュアル自己回帰モデルであるMaineCoonを提案する。我々の知る限り、MaineCoonはソーシャルインタラクティブアプリケーション向けに最適化された初のリアルタイム音声-ビジュアル生成モデルでもある。効率的で安定したトレーニングを可能にするために、セルフリサンプリング、クロスモーダル表現アライメント、ドメイン認識型選好最適化、および強化型オンラインポリシー蒸留（ROPD）など、いくつかの新しい手法をMaineCoonに導入する。また、エージェント型キャッシュ管理とプロンプト計画によってドリフトを軽減しながら、千秒規模以上の生成をサポートする初のエージェント型ストリーミング推論フレームワークを設計する。これらの革新により、トレーニングを大幅に加速すると同時に、リアルタイム推論性能を最適化する。我々は、この研究が高品質、低レイテンシ、長期の音声-ビジュアル自己回帰モデルにおいて新たな最先端（SOTA）性能ベンチマークを確立するだけでなく、次世代のAIネイティブソーシャルプラットフォームに望まれるパラダイムシフトを示していると確信する。

English

As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.