论数据工程在扩展大型语言模型终端能力中的作用

摘要

尽管大型语言模型的终端能力近期进展迅速，但支撑顶尖终端智能体的训练数据策略仍鲜有公开。我们通过系统性研究终端智能体的数据工程实践填补这一空白，做出两项关键贡献：(1) Terminal-Task-Gen——支持基于种子和基于技能的任务构建的轻量级合成任务生成流程；(2) 涵盖数据筛选、课程学习、长上下文训练及扩展规律的综合数据与训练策略分析。该流程产出Terminal-Corpus——面向终端任务的大规模开源数据集。基于该数据集，我们训练了从Qwen3(8B, 14B, 32B)初始化的Nemotron-Terminal模型系列，在Terminal-Bench 2.0上实现显著提升：Nemotron-Terminal-8B从2.5%提升至13.0%，Nemotron-Terminal-14B从4.0%提升至20.2%，Nemotron-Terminal-32B从3.4%提升至27.4%，达到与参数量更大模型相媲美的性能。为加速该领域研究，我们在https://huggingface.co/collections/nvidia/nemotron-terminal开源了模型检查点及大部分合成数据集。

English

Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at https://huggingface.co/collections/nvidia/nemotron-terminal.

论数据工程在扩展大型语言模型终端能力中的作用

On Data Engineering for Scaling LLM Terminal Capabilities

摘要

Support