論擴展大型語言模型終端能力的資料工程

摘要

儘管大型語言模型的終端能力近期快速進步，頂尖終端代理背後的訓練資料策略仍大多未公開。我們透過對終端代理資料工程實踐的系統性研究來彌補此缺口，提出兩項關鍵貢獻：(1) Terminal-Task-Gen——支援基於種子與技能的任務構建的輕量級合成任務生成流程；(2) 包含過濾、課程學習、長上下文訓練與擴展行為的資料與訓練策略全面分析。我們的流程產出 Terminal-Corpus——一個大規模開源終端任務資料集。使用此資料集，我們訓練了從 Qwen3(8B, 14B, 32B) 初始化的 Nemotron-Terminal 模型系列，在 Terminal-Bench 2.0 上實現顯著提升：Nemotron-Terminal-8B 從 2.5% 提升至 13.0%，Nemotron-Terminal-14B 從 4.0% 提升至 20.2%，Nemotron-Terminal-32B 從 3.4% 提升至 27.4%，達到與更大規模模型相當的效能。為加速該領域研究，我們於 https://huggingface.co/collections/nvidia/nemotron-terminal 開源了模型檢查點與大部分合成資料集。

English

Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at https://huggingface.co/collections/nvidia/nemotron-terminal.

論擴展大型語言模型終端能力的資料工程

On Data Engineering for Scaling LLM Terminal Capabilities

摘要

Support