Game-TARS: Voorgetrainde fundamentele modellen voor schaalbare generalistische multimodale spelagenten

Samenvatting

Wij presenteren Game-TARS, een generalistische game-agent die is getraind met een uniforme, schaalbare actieruimte verankerd aan mensgerichte, native toetsenbord-muisinvoer. In tegenstelling tot API- of GUI-gestuurde benaderingen, maakt dit paradigma grootschalige continue voorafgaande training over heterogene domeinen mogelijk, waaronder besturingssystemen, het web en simulatiegames. Game-TARS is voorgetraind op meer dan 500B tokens met diverse trajecten en multimodale gegevens. Belangrijke technieken omvatten een vervallend continu verlies om causale verwarring te verminderen en een efficiënte Sparse-Thinking-strategie die de redeneerdiepte en inferentiekosten in balans brengt. Experimenten tonen aan dat Game-TARS ongeveer twee keer zo'n hoog slagingspercentage behaalt als het vorige state-of-the-art-model bij open-wereld Minecraft-taken, de algemeenheid van onervaren mensen benadert in onbekende web-3D-games, en beter presteert dan GPT-5, Gemini-2.5-Pro en Claude-4-Sonnet in FPS-benchmarks. Schaalresultaten voor trainings- en testtijd bevestigen dat de uniforme actieruimte verbeteringen behoudt bij opschaling naar cross-game en multimodale gegevens. Onze resultaten tonen aan dat eenvoudige, schaalbare actierepresentaties gecombineerd met grootschalige voorafgaande training een veelbelovend pad bieden naar generalistische agents met brede computergebruik-vaardigheden.

English

We present Game-TARS, a generalist game agent trained with a unified, scalable action space anchored to human-aligned native keyboard-mouse inputs. Unlike API- or GUI-based approaches, this paradigm enables large-scale continual pre-training across heterogeneous domains, including OS, web, and simulation games. Game-TARS is pre-trained on over 500B tokens with diverse trajectories and multimodal data. Key techniques include a decaying continual loss to reduce causal confusion and an efficient Sparse-Thinking strategy that balances reasoning depth and inference cost. Experiments show that Game-TARS achieves about 2 times the success rate over the previous sota model on open-world Minecraft tasks, is close to the generality of fresh humans in unseen web 3d games, and outperforms GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet in FPS benchmarks. Scaling results on training-time and test-time confirm that the unified action space sustains improvements when scaled to cross-game and multimodal data. Our results demonstrate that simple, scalable action representations combined with large-scale pre-training provide a promising path toward generalist agents with broad computer-use abilities.

Game-TARS: Voorgetrainde fundamentele modellen voor schaalbare generalistische multimodale spelagenten

Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents

Samenvatting

Support