Tmax: 터미널 에이전트를 위한 간단한 레시피

초록

터미널 사용 에이전트는 언어 모델(LM)의 가장 인기 있는 다운스트림 애플리케이션으로 빠르게 자리 잡았다. 이러한 보편성에도 불구하고, 이 모델들의 강화학습(RL) 기반 훈련에 대한 학술 연구는 상대적으로 드문데, 이는 아마도 까다로운 벤치마크, 데이터 부족, 그리고 간단한 기준 레시피의 부재 때문일 것이다. 본 논문에서는 현재까지 공개된 최고 수준의 터미널 에이전트용 오픈 RL 레시피인 Tmax를 제시하며, 이를 통해 오픈 데이터 레시피를 프런티어에 더 가깝게 끌어올린다. 간단한 구조이지만, 본 레시피는 9B 파라미터만으로 Terminal-Bench 2.0에서 27%의 성능을 달성하여, 이전 연구의 훨씬 더 큰 모델들을 능가한다. 구체적으로, 난이도 제어, 페르소나, 그리고 검증기 다양화를 결합한 새로운 분류 체계를 사용하여 데이터를 생성함으로써, RL 및 SFT 훈련을 위한 대량의 터미널 환경을 저렴하게 생성할 수 있었다. 본 연구에서는 이전에 공개된 터미널 에이전트 데이터셋보다 2.5배 이상 큰 터미널 데이터셋을 오픈소스로 공개한다. 그런 다음, 단순한 결과 기반(outcome-only) 레시피를 사용하여, 생성한 데이터로 공개 가중치 모델을 RL을 통해 훈련한다. 향후 터미널 에이전트에 대한 공개 학술 연구의 강력한 기준선으로서 데이터, 모델, 코드를 https://github.com/hamishivi/tmax에서 공개한다.

English

Terminal-using agents have quickly become the most popular downstream application of language models (LMs). Despite their prevalence, relatively little academic work has examined RL-based training of these models, likely due to difficult benchmarks, a lack of data, and a lack of simple baseline recipes. We present Tmax, the strongest open RL recipe for terminal agents to date, bringing open data recipes closer to the frontier. While simple, our recipe achieves 27\% on Terminal-Bench 2.0 with only 9B parameters, outperforming much larger models from prior work. Concretely, we generate data using a novel taxonomy, combining difficulty control, personas, and verifier diversification, which allows us to cheaply generate large amounts of terminal environments for RL and SFT training. We open-source our terminal dataset, which is over 2.5x larger than previously released terminal-agent datasets. We then train open-weight models using RL with our data, using a simple, outcome-only recipe. We release our data, models, and code as a strong baseline for future open academic work on terminal agents at https://github.com/hamishivi/tmax.