ECHO: 터미널 에이전트들이 세계 모델을 무료로 학습하다

초록

CLI 에이전트는 언어 모델이 체화된 환경에 가장 근접한 형태로, 모델이 명령을 내리면 터미널이 이를 실행하고 반환되는 스트림(stdout, 오류, 파일, 로그, 추적)이 결과를 기록한다. 우리는 이 스트림이 감독 신호라고 주장하지만, 표준 에이전트 강화학습은 이를 무시한다. GRPO 방식의 훈련은 희소한 결과 수준 보상으로 행동 토큰을 업데이트할 뿐, 이미 롤아웃에 포함된 환경 응답은 무시한다. 실패한 롤아웃은 환경이 어떻게 반응하는지에 대한 풍부한 증거를 포함하고 있음에도 불구하고 정책 그래디언트 신호를 거의 제공하지 못한다. 이에 우리는 ECHO(Environment Cross-entropy Hybrid Objective)를 제안한다. 이는 행동 토큰에 대한 표준 정책 그래디언트 손실과 정책이 자신의 행동으로 인한 환경 관찰 토큰을 예측하도록 훈련하는 보조 손실을 결합한 하이브리드 목적 함수이다. ECHO는 GRPO와 동일한 순전파를 재사용하며 추가 롤아웃이 필요 없고, 터미널 피드백을 모든 롤아웃에 대한 밀집 감독으로 변환한다. ECHO는 TerminalBench-2.0에서 GRPO의 pass@1을 두 배로 향상시킨다. Qwen3-8B는 2.70%에서 5.17%로, Qwen3-14B는 5.17%에서 10.79%로 개선된다. 또한 ECHO는 정책이 생성하지 않은 궤적에 대해서도 터미널 동역학을 더 잘 예측하는 정책을 산출한다. 보류된 롤아웃에서 환경 토큰 교차 엔트로피를 급격히 줄이는 반면, GRPO만으로는 거의 변화가 없다. 기본 Qwen3-8B에서 ECHO는 전문가 시연 없이도 보류된 터미널 작업에서 전문가 SFT 후 GRPO의 성능과 일치하며, TerminalBench-2.0에서 전문가 SFT 초기화 이점의 약 절반을 회복한다. 일부 설정에서는 환경 예측 손실만으로 검증자 없는 자기 개선이 가능해져, 정책이 환경 상호작용만으로 학습하여 보지 못한 OOD 작업에서 향상될 수 있다. 이러한 결과는 환경 관찰이 단순히 미래 행동을 위한 맥락이 아니라, 모든 롤아웃에 이미 존재하는 밀집되고 온-정책적인 감독 신호임을 시사한다.

English

CLI agents are the closest thing language models have to an embodied setting: the model emits commands, the terminal executes them, and the returned stream -- stdout, errors, files, logs, and traces -- records the consequences. We argue that this stream is a supervision signal, but standard agent RL discards it: GRPO-style training updates action tokens with sparse outcome-level rewards while ignoring environment responses already in the rollout. Failed rollouts provide little policy-gradient signal despite containing rich evidence about how the environment responds. We introduce ECHO (Environment Cross-entropy Hybrid Objective), a hybrid objective that combines the standard policy-gradient loss on action tokens with an auxiliary loss that trains the policy to predict environment observation tokens resulting from its own actions. ECHO reuses the same forward pass as GRPO, requires no additional rollouts, and turns terminal feedback into dense supervision for all rollouts. ECHO doubles GRPO pass@1 on TerminalBench-2.0: Qwen3-8B improves from 2.70% to 5.17%, and Qwen3-14B from 5.17% to 10.79%. ECHO also produces policies that better predict terminal dynamics, even on trajectories they did not generate: across held-out rollouts, it sharply reduces environment-token cross-entropy while GRPO alone barely changes it. From base Qwen3-8B, ECHO matches expert-SFT-then-GRPO performance on held-out terminal tasks without expert demonstrations, and recovers roughly half of the expert-SFT initialization benefit on TerminalBench-2.0. In some settings, the environment prediction loss alone enables verifier-free self-improvement, allowing policies to improve on unseen OOD tasks by learning only from environment interactions. Together, these results suggest that environment observations are not merely context for future actions, but a dense, on-policy supervision signal already present in every rollout.