Game-TARS: スケーラブルな汎用マルチモーダルゲームエージェントのための事前学習基盤モデル

要旨

我々はGame-TARSを提案する。これは人間のキーボード・マウス操作に基づいた統一的なスケーラブルな行動空間で学習された汎用ゲームエージェントである。APIやGUIベースの手法とは異なり、このパラダイムによりOS、Web、シミュレーションゲームなど異種ドメインにわたる大規模な継続事前学習が可能となる。Game-TARSは500Bトークン以上の多様な軌跡データとマルチモーダルデータで事前学習されている。主要技術として、因果的混雑を軽減する減衰型継続損失と、推論深度と推論コストのバランスを取る効率的なSparse-Thinking戦略を採用。実験では、Game-TARSがオープンワールドMinecraftタスクで従来のSOTAモデル比約2倍の成功率を達成、未体験のWeb 3Dゲームでは人間初心者に近い汎化性能を示し、FPSベンチマークではGPT-5、Gemini-2.5-Pro、Claude-4-Sonnetを上回った。訓練時と推論時のスケーリング結果は、統一的行動空間がゲーム横断的・マルチモーダルデータへのスケールアップにおいて改善を維持することを確認。本研究は、単純でスケーラブルな行動表現と大規模事前学習の組み合わせが、広範なコンピュータ利用能力を持つ汎用エージェントへの有望な道筋を示すことを実証する。

English

We present Game-TARS, a generalist game agent trained with a unified, scalable action space anchored to human-aligned native keyboard-mouse inputs. Unlike API- or GUI-based approaches, this paradigm enables large-scale continual pre-training across heterogeneous domains, including OS, web, and simulation games. Game-TARS is pre-trained on over 500B tokens with diverse trajectories and multimodal data. Key techniques include a decaying continual loss to reduce causal confusion and an efficient Sparse-Thinking strategy that balances reasoning depth and inference cost. Experiments show that Game-TARS achieves about 2 times the success rate over the previous sota model on open-world Minecraft tasks, is close to the generality of fresh humans in unseen web 3d games, and outperforms GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet in FPS benchmarks. Scaling results on training-time and test-time confirm that the unified action space sustains improvements when scaled to cross-game and multimodal data. Our results demonstrate that simple, scalable action representations combined with large-scale pre-training provide a promising path toward generalist agents with broad computer-use abilities.

Game-TARS: スケーラブルな汎用マルチモーダルゲームエージェントのための事前学習基盤モデル

Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents

要旨

Support