ECHO: 终端智能体无需额外代价学习世界模型

摘要

CLI代理是语言模型最接近具身场景的形式：模型发出指令，终端执行指令，返回的流——包括标准输出、错误、文件、日志和跟踪——记录了执行结果。我们认为这个流是一种监督信号，但标准的代理强化学习将其丢弃了：GRPO风格的训练用稀疏的结果级奖励更新动作标记，却忽略了轨迹中已有的环境响应。失败的轨迹虽然包含了关于环境如何响应的丰富证据，却几乎不提供策略梯度信号。为此，我们提出ECHO（环境交叉熵混合目标），这是一种混合目标函数，它将动作标记的标准策略梯度损失与辅助损失相结合，该辅助损失训练策略预测由其自身动作所产生的环境观测标记。ECHO复用GRPO的前向传播，无需额外轨迹，并将终端反馈转化为所有轨迹的密集监督。在TerminalBench-2.0上，ECHO使GRPO的pass@1指标翻了一番：Qwen3-8B从2.70%提升至5.17%，Qwen3-14B从5.17%提升至10.79%。即使是在非自身生成的轨迹上，ECHO也能产生更好预测终端动态的策略：在保留的轨迹中，它显著降低了环境标记的交叉熵，而单独使用GRPO则几乎无变化。基于Qwen3-8B基座模型，ECHO在无需专家演示的情况下，在保留的终端任务上实现了与专家SFT后接GRPO相当的性能，并在TerminalBench-2.0上恢复了大约一半的专家SFT初始化优势。在某些设置中，仅靠环境预测损失就能实现无验证器的自我提升，使策略仅通过与环境交互就能在未见过的分布外任务上取得进步。综合上述结果，这些发现表明环境观测不仅仅是未来行动的上下文，更是每条轨迹中已经存在的、密集的在策略监督信号。

English

CLI agents are the closest thing language models have to an embodied setting: the model emits commands, the terminal executes them, and the returned stream -- stdout, errors, files, logs, and traces -- records the consequences. We argue that this stream is a supervision signal, but standard agent RL discards it: GRPO-style training updates action tokens with sparse outcome-level rewards while ignoring environment responses already in the rollout. Failed rollouts provide little policy-gradient signal despite containing rich evidence about how the environment responds. We introduce ECHO (Environment Cross-entropy Hybrid Objective), a hybrid objective that combines the standard policy-gradient loss on action tokens with an auxiliary loss that trains the policy to predict environment observation tokens resulting from its own actions. ECHO reuses the same forward pass as GRPO, requires no additional rollouts, and turns terminal feedback into dense supervision for all rollouts. ECHO doubles GRPO pass@1 on TerminalBench-2.0: Qwen3-8B improves from 2.70% to 5.17%, and Qwen3-14B from 5.17% to 10.79%. ECHO also produces policies that better predict terminal dynamics, even on trajectories they did not generate: across held-out rollouts, it sharply reduces environment-token cross-entropy while GRPO alone barely changes it. From base Qwen3-8B, ECHO matches expert-SFT-then-GRPO performance on held-out terminal tasks without expert demonstrations, and recovers roughly half of the expert-SFT initialization benefit on TerminalBench-2.0. In some settings, the environment prediction loss alone enables verifier-free self-improvement, allowing policies to improve on unseen OOD tasks by learning only from environment interactions. Together, these results suggest that environment observations are not merely context for future actions, but a dense, on-policy supervision signal already present in every rollout.