Game-TARS:面向可擴展通用型多模態遊戲代理的預訓練基礎模型
Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents
October 27, 2025
作者: Zihao Wang, Xujing Li, Yining Ye, Junjie Fang, Haoming Wang, Longxiang Liu, Shihao Liang, Junting Lu, Zhiyong Wu, Jiazhan Feng, Wanjun Zhong, Zili Li, Yu Wang, Yu Miao, Bo Zhou, Yuanfan Li, Hao Wang, Zhongkai Zhao, Faming Wu, Zhengxuan Jiang, Weihao Tan, Heyuan Yao, Shi Yan, Xiangyang Li, Yitao Liang, Yujia Qin, Guang Shi
cs.AI
摘要
我們提出Game-TARS——一種基於人類對齊的鍵鼠原生輸入錨定、採用統一可擴展動作空間訓練的通才型遊戲智能體。與基於API或GUI的方法不同,該範式支持跨異構領域的大規模持續預訓練,涵蓋操作系統、網頁及模擬遊戲等場景。Game-TARS通過5000億標記量的多模態數據與多樣化軌跡進行預訓練,其核心技術包括用於降低因果混淆的衰減持續損失函數,以及平衡推理深度與推斷成本的稀疏思維策略。實驗表明:Game-TARS在開放世界《我的世界》任務中達成約兩倍於前代最優模型的成功率,在未見網頁3D遊戲中接近人類新手的泛化能力,並在FPS基準測試中超越GPT-5、Gemini-2.5-Pro與Claude-4-Sonnet。訓練階段與測試階段的擴展實驗證實,統一動作空間在跨遊戲多模態數據擴展時能持續提升性能。我們的研究成果表明,簡潔可擴展的動作表徵與大規模預訓練相結合,為構建具備廣泛計算機使用能力的通才智能體開闢了可行路徑。
English
We present Game-TARS, a generalist game agent trained with a unified,
scalable action space anchored to human-aligned native keyboard-mouse inputs.
Unlike API- or GUI-based approaches, this paradigm enables large-scale
continual pre-training across heterogeneous domains, including OS, web, and
simulation games. Game-TARS is pre-trained on over 500B tokens with diverse
trajectories and multimodal data. Key techniques include a decaying continual
loss to reduce causal confusion and an efficient Sparse-Thinking strategy that
balances reasoning depth and inference cost. Experiments show that Game-TARS
achieves about 2 times the success rate over the previous sota model on
open-world Minecraft tasks, is close to the generality of fresh humans in
unseen web 3d games, and outperforms GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet
in FPS benchmarks. Scaling results on training-time and test-time confirm that
the unified action space sustains improvements when scaled to cross-game and
multimodal data. Our results demonstrate that simple, scalable action
representations combined with large-scale pre-training provide a promising path
toward generalist agents with broad computer-use abilities.