CLI-Universe: 터미널 에이전트를 위한 검증 가능한 태스크 합성 엔진

초록

최근 LLM 기반 터미널 에이전트가 유망한 성능을 보여주고 있지만, 고품질의 실행 가능한 훈련 데이터의 부족은 여전히 심각한 병목 현상으로 남아 있다. 기존의 합성 파이프라인은 일반적으로 표면적인 인공물을 태스크에 적용하여 확장하며, 이로 인해 모호한 명령어, 얕은 실행 경로, 약한 학습 신호를 제공하는 취약한 테스트가 자주 발생한다. 이를 극복하기 위해, 우리는 CLI-Universe를 도입한다. 이는 터미널 에이전트 태스크를 구성하는 원리 기반의 합성 엔진이다. CLI-Universe는 다차원 능력 분류체계(도메인, 기술 유형, 능력, 엔지니어링 기둥) 전반에 걸쳐 조합을 샘플링하여 후보 태스크를 생성한 후, 실제 기술 자료에 대한 증거 기반의 심층 조사를 통해 각 후보를 구체화한다. 엄격한 감독을 보장하기 위해, 검증된 청사진은 도커화된 환경으로 인스턴스화되고, 루브릭 기반 테스트 구성, 힌트 조건부 필터링, 엄격한 실패-통과 검사를 특징으로 하는 다단계 실행 가능 검증 파이프라인을 거친다. 후보 생성부터 검증까지 전체 파이프라인에서 약 3분의 2의 후보가 폐기되며, 진정성 있고 검증 가능하며 사소하지 않은 난이도를 가진 것만 남게 된다. 프레임워크를 검증하기 위해, 우리는 CLI-Universe-6K라고 하는 6,000개의 궤적으로 구성된 고도로 정제된 데이터셋을 인스턴스화했다. 놀랍게도, CLI-Universe-6K로 Qwen3-32B를 미세 조정하여 Terminal-Bench 2.0에서 33.4%를 달성했다. 이는 오픈소스 데이터로 훈련된 32B 파라미터 이하 모델 중 새로운 최고 성능을 기록하며, 한 자릿수 이상 큰 여러 모델을 능가함으로써 구조화되고 충실도 높은 합성의 놀라운 데이터 효율성을 입증한다.

English

While recent LLM-based terminal agents have demonstrated promising capabilities, the scarcity of high-quality, executable training data remains a critical bottleneck. Existing synthesis pipelines typically scale by retrofitting surface-level artifacts into tasks, frequently yielding ambiguous instructions, shallow execution paths, and brittle tests that provide weak learning signals. To overcome this, we introduce CLI-Universe, a principled synthesis engine that constructs terminal-agent tasks. CLI-Universe generates candidate tasks by sampling combinations across a multi-dimensional capability taxonomy (domain, skill type, capability, and engineering pillar), then grounds each candidate through evidence-guided deep research over real-world technical materials. To ensure rigorous supervision, validated blueprints are instantiated into Dockerized environments and subjected to a multi-stage executable verification pipeline featuring rubric-gated test construction, hint-conditional filtering, and strict fail-to-pass checking. Across the full pipeline, from candidate generation to verification, approximately two-thirds of candidates are discarded, retaining only those that are genuine, verifiable, and non-trivially challenging. To validate our framework, we instantiate a highly distilled dataset of 6,000 trajectories called CLI-Universe-6K. Remarkably, fine-tuning Qwen3-32B on CLI-Universe-6K achieves 33.4% on Terminal-Bench 2.0. This sets a new state-of-the-art for models trained on open-source data at or below 32B parameters, and outperforms several models an order of magnitude larger, demonstrating the profound data efficiency of structured, high-fidelity synthesis.