Gamma-World: 두 명의 플레이어를 넘어선 생성적 다중 에이전트 세계 모델링

초록

대화형 비디오 생성을 위한 세계 모델은 주로 단일 에이전트 설정에 초점을 맞추어 왔으며, 이 경우 미래 관측값이 단일 제어 신호로부터 생성된다. 그러나 많은 생성 환경에서는 다중 에이전트 상호작용, 즉 여러 플레이어, 로봇 또는 체화된 에이전트가 공유 공간 내에서 동시에 행동해야 한다. 이러한 설정으로 세계 모델을 확장하기 위해서는 원칙적인 다중 에이전트 설계가 필요하다. 에이전트는 독립적으로 제어 가능해야 하며, 순열 대칭성을 가지며, 시간과 관점에 걸쳐 일관성을 유지하면서 효율적인 추론을 지원해야 한다. 본 논문에서는 대화형 시뮬레이션을 위한 생성적 다중 에이전트 세계 모델을 제시한다. 이 모델은 3D RoPE의 파라미터 없는 확장인 심플렉스 회전 에이전트 인코딩(Simplex Rotary Agent Encoding)을 도입하여, 에이전트를 회전 각 공간에서 정규 심플렉스의 꼭짓점으로 표현한다. 이는 각 에이전트에 고유한 위상을 부여하면서 모든 에이전트를 순열 등가로 만들어, 학습된 슬롯별 식별자나 고정된 에이전트 순서 없이 확장 가능한 에이전트 정체성을 가능하게 한다. 에이전트 간의 모든 쌍(all-to-all) 어텐션을 피하기 위해, 우리는 희소 허브 어텐션(Sparse Hub Attention)을 추가로 제안한다. 여기서 학습 가능한 허브 토큰이 에이전트 간 토큰 상호작용을 중재하여, 에이전트 간 어텐션 비용을 에이전트 수에 대해 이차에서 선형으로 감소시킨다. 실시간 롤아웃을 위해, 전체 컨텍스트 확산 교사 모델을 증류하여 인과적 학생 모델로 만들고, KV 캐싱을 사용하여 시간 블록을 순차적으로 생성함으로써 24FPS에서 행동 반응형 생성을 가능하게 한다. 다중 플레이어 가상 환경 실험에서, 우리 모델은 슬롯 기반 및 밀집 어텐션 기준선에 비해 비디오 충실도, 행동 제어 가능성 및 에이전트 간 일관성을 개선하면서, 추가 학습 없이 2명에서 4명의 플레이어로 일반화됨을 보여준다.

English

World models for interactive video generation have largely focused on single-agent settings, where future observations are generated from a single control signal. However, many generated environments require multi-agent interaction: multiple players, robots, or embodied agents act simultaneously within a shared space. Scaling world models to such settings requires a principled multi-agent design: agents should remain independently controllable, permutation-symmetric, and support efficient inference while maintaining consistency across time and perspectives. In this paper, we present our generative multi-agent world model for interactive simulation. It introduces Simplex Rotary Agent Encoding, a parameter-free extension of 3D RoPE that represents agents as vertices of a regular simplex in rotary angle space. This gives each agent a distinct phase while making all agents permutation-equivalent, enabling scalable agent identity without learned per-slot identities or a fixed agent ordering. To avoid dense all-to-all attention across agents, we further propose Sparse Hub Attention, where learnable hub tokens mediate token interaction across agents, reducing cross-agent attention cost from quadratic to linear in the number of agents. For real-time rollout, we distill a full-context diffusion teacher into a causal student that generates temporal blocks sequentially with KV caching, enabling action-responsive generation at 24 FPS. Experiments in multiplayer virtual environments show that our model improves video fidelity, action controllability, and inter-agent consistency over slot-based and dense-attention baselines, while generalizing from two to four players without additional training.