Tmax: 终端智能体的简单秘诀

摘要

终端代理已迅速成为语言模型（LM）最流行的下游应用。尽管其广泛存在，但学术领域对基于强化学习（RL）训练此类模型的研究相对较少，这可能是由于基准测试难度大、数据缺乏以及缺少简单的基线方案。我们提出了Tmax，这是迄今为止针对终端代理的最强开源RL方案，将开放数据方案进一步推向前沿。尽管方法简单，我们的方案仅凭9B参数就在Terminal-Bench 2.0上达到了27%的性能，超越了过去工作中规模更大的模型。具体而言，我们采用一种新颖的分类体系生成数据，结合难度控制、角色设定和验证器多样化，从而能够以低成本生成大量用于RL和监督微调（SFT）训练的终端环境。我们开源了终端数据集，其规模是此前发布的终端代理数据集的2.5倍以上。随后，我们使用我们的数据通过RL训练开放权重的模型，仅采用简单的、仅基于结果的方案。我们将数据、模型和代码作为未来终端代理开放学术研究的强基线发布，地址为https://github.com/hamishivi/tmax。

English

Terminal-using agents have quickly become the most popular downstream application of language models (LMs). Despite their prevalence, relatively little academic work has examined RL-based training of these models, likely due to difficult benchmarks, a lack of data, and a lack of simple baseline recipes. We present Tmax, the strongest open RL recipe for terminal agents to date, bringing open data recipes closer to the frontier. While simple, our recipe achieves 27\% on Terminal-Bench 2.0 with only 9B parameters, outperforming much larger models from prior work. Concretely, we generate data using a novel taxonomy, combining difficulty control, personas, and verifier diversification, which allows us to cheaply generate large amounts of terminal environments for RL and SFT training. We open-source our terminal dataset, which is over 2.5x larger than previously released terminal-agent datasets. We then train open-weight models using RL with our data, using a simple, outcome-only recipe. We release our data, models, and code as a strong baseline for future open academic work on terminal agents at https://github.com/hamishivi/tmax.