MaineCoon: Het nastreven van een real-time audio-visueel sociaal wereldmodel

Samenvatting

Aangezien een groeiende meerderheid van mondiale video-inhoud wordt geconsumeerd op sociale platforms voor interactieve sociale doeleinden, zijn videogeneratiemodellen gebouwd voor sociale werelden belangrijk maar grotendeels over het hoofd gezien door eerdere studies. In dit werk definiëren we de positie van sociale wereldmodellen en bouwen we een prototype model als eerste stap richting dit doel. Terwijl eerdere wereldmodellen met succes fysieke omgevingen of verkenning van gamewerelden simuleerden, blijven ze fundamenteel losgekoppeld van mensgerichte sociale dynamiek. Om deze kloof te overbruggen als eerste stap naar sociale wereldmodellen, presenteren we MaineCoon, het eerste real-time audio-visuele autoregressieve model dat 22B parameters heeft en in staat is tot real-time streaming generatie en sub-seconde interactie, met een recordbrekende framesnelheid van maximaal 47,5 FPS op een enkele GPU. Voor zover wij weten is MaineCoon ook het eerste real-time audio-visuele generatiemodel dat specifiek is geoptimaliseerd voor sociaal-interactieve toepassingen. Om efficiënte en stabiele training mogelijk te maken, introduceren we verschillende nieuwe technieken in MaineCoon, waaronder self-resampling, cross-modale representatie-uitlijning, domeinbewuste preferentieoptimalisatie en versterkte online-beleidsdistillatie (ROPD). We ontwerpen ook het eerste agentische streaming inferentieframework dat generatie op duizenden seconden of zelfs langer ondersteunt, terwijl drift wordt beperkt met agentisch cachebeheer en promptplanning. Deze innovaties versnellen de training aanzienlijk terwijl de real-time inferentieprestaties worden geoptimaliseerd. Wij geloven dat dit werk niet alleen een nieuwe state-of-the-art (SOTA) prestatiebenchmark stelt voor hoogwaardige, lage latentie en lange horizon audio-visuele autoregressieve modellen, maar ook wijst op de paradigmaverschuiving die gewenst is voor de volgende generatie AI-native sociale platforms.

English

As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.