Tmax: ターミナルエージェントのためのシンプルなレシピ

要旨

ターミナルエージェントは、急速に言語モデル（LM）の最も人気のある下流アプリケーションとなっている。その普及にもかかわらず、これらのモデルのRLベースの訓練を調査した学術研究は比較的少ない。その理由は、困難なベンチマーク、データの不足、そしてシンプルなベースラインレシピの欠如にあると考えられる。我々はTmaxを提案する。これは現時点で最も強力なターミナルエージェント向けオープンRLレシピであり、オープンデータレシピをフロンティアに近づけるものである。シンプルながら、我々のレシピはわずか9BパラメータでTerminal-Bench 2.0において27%を達成し、先行研究のより大規模なモデルを上回る。具体的には、我々は新しい分類法を用いてデータを生成する。これは難易度制御、ペルソナ、検証器の多様化を組み合わせたものであり、RLおよびSFT訓練用のターミナル環境を大量に低コストで生成することを可能にする。我々はターミナルデータセットをオープンソース化する。これは以前に公開されたターミナルエージェントデータセットの2.5倍以上の規模である。次に、我々のデータを用いてRLでオープンウェイトモデルを訓練する。その際、シンプルで結果のみに基づくレシピを使用する。我々はデータ、モデル、コードを、将来のターミナルエージェントに関するオープンな学術研究のための強力なベースラインとして、https://github.com/hamishivi/tmax で公開する。

English

Terminal-using agents have quickly become the most popular downstream application of language models (LMs). Despite their prevalence, relatively little academic work has examined RL-based training of these models, likely due to difficult benchmarks, a lack of data, and a lack of simple baseline recipes. We present Tmax, the strongest open RL recipe for terminal agents to date, bringing open data recipes closer to the frontier. While simple, our recipe achieves 27\% on Terminal-Bench 2.0 with only 9B parameters, outperforming much larger models from prior work. Concretely, we generate data using a novel taxonomy, combining difficulty control, personas, and verifier diversification, which allows us to cheaply generate large amounts of terminal environments for RL and SFT training. We open-source our terminal dataset, which is over 2.5x larger than previously released terminal-agent datasets. We then train open-weight models using RL with our data, using a simple, outcome-only recipe. We release our data, models, and code as a strong baseline for future open academic work on terminal agents at https://github.com/hamishivi/tmax.