Gamma-World: 2人を超える生成的多エージェント世界モデリング

要旨

インタラクティブな動画生成のためのワールドモデルは、主に単一エージェントの設定に焦点を当ててきました。そこでは将来の観測が単一の制御信号から生成されます。しかし、生成される環境の多くはマルチエージェントの相互作用を必要とします。つまり、複数のプレイヤー、ロボット、または身体化エージェントが共有空間内で同時に動作します。ワールドモデルをこのような設定に拡張するには、原理に基づいたマルチエージェント設計が必要です。エージェントは独立して制御可能であり、置換対称性を持ち、時間と視点にわたる一貫性を維持しながら効率的な推論をサポートする必要があります。本論文では、インタラクティブシミュレーションのための生成型マルチエージェントワールドモデルを提案します。このモデルは、Simplex Rotary Agent Encodingを導入します。これは3D RoPEのパラメータフリーな拡張であり、エージェントを回転角度空間における正則単体の頂点として表現します。これにより、各エージェントに異なる位相を与えつつ、すべてのエージェントを置換等価にし、学習されたスロット単位の識別子や固定されたエージェント順序なしにスケーラブルなエージェント識別を可能にします。エージェント間の密な全対全注意を回避するために、さらにSparse Hub Attentionを提案します。これは学習可能なハブトークンがエージェント間のトークン相互作用を仲介し、エージェント間の注意コストをエージェント数に対して2次から線形に削減します。リアルタイムロールアウトのために、フルコンテキストの拡散ティーチャーを因果的学生モデルに蒸留し、KVキャッシングを用いて時間ブロックを順次生成することで、24FPSでのアクション応答生成を実現します。マルチプレイヤー仮想環境での実験により、本モデルがスロットベースおよび密な注意ベースラインと比較して、映像忠実度、行動制御性、エージェント間一貫性を向上させ、追加学習なしで2人から4人への一般化を実現することを示します。

English

World models for interactive video generation have largely focused on single-agent settings, where future observations are generated from a single control signal. However, many generated environments require multi-agent interaction: multiple players, robots, or embodied agents act simultaneously within a shared space. Scaling world models to such settings requires a principled multi-agent design: agents should remain independently controllable, permutation-symmetric, and support efficient inference while maintaining consistency across time and perspectives. In this paper, we present our generative multi-agent world model for interactive simulation. It introduces Simplex Rotary Agent Encoding, a parameter-free extension of 3D RoPE that represents agents as vertices of a regular simplex in rotary angle space. This gives each agent a distinct phase while making all agents permutation-equivalent, enabling scalable agent identity without learned per-slot identities or a fixed agent ordering. To avoid dense all-to-all attention across agents, we further propose Sparse Hub Attention, where learnable hub tokens mediate token interaction across agents, reducing cross-agent attention cost from quadratic to linear in the number of agents. For real-time rollout, we distill a full-context diffusion teacher into a causal student that generates temporal blocks sequentially with KV caching, enabling action-responsive generation at 24 FPS. Experiments in multiplayer virtual environments show that our model improves video fidelity, action controllability, and inter-agent consistency over slot-based and dense-attention baselines, while generalizing from two to four players without additional training.