ECHO: 終端代理無償學習世界模型

摘要

CLI代理是語言模型最接近具身場景的實現：模型發出指令，終端機執行指令，而返回的串流——包括標準輸出、錯誤訊息、檔案、日誌及追蹤記錄——則記錄了操作的後果。我們主張這條串流本身即是一種監督訊號，但標準的代理強化學習卻將其拋棄：GRPO風格的訓練僅使用稀疏的結果層級獎勵來更新動作標記，卻忽略了推論過程中已產生的環境回應。失敗的推論結果幾乎無法提供策略梯度訊號，儘管其中蕴含了環境如何回應的豐富證據。為此，我們提出ECHO（環境交叉熵混合目標），這是一種結合標準策略梯度損失（作用於動作標記）與輔助損失的混合目標，後者訓練策略模型預測其自身動作所導致的環境觀察標記。ECHO重複使用與GRPO相同的前向傳遞，無需額外的推論過程，並將終端機反饋轉化為所有推論結果的密集監督訊號。在TerminalBench-2.0基準上，ECHO將GRPO的首次通過率提升了一倍：Qwen3-8B從2.70%提高到5.17%，Qwen3-14B從5.17%提高到10.79%。此外，即使面對非模型自身生成的軌跡，ECHO也能產生更準確預測終端機動態的策略：在保留的推論結果中，ECHO顯著降低了環境標記的交叉熵，而單獨使用GRPO則幾乎無法改變交叉熵。基於Qwen3-8B的基礎模型，ECHO在無需專家示範的情況下，達到了專家SFT後再經GRPO訓練在保留終端機任務上的表現；在TerminalBench-2.0上，它大約恢復了專家SFT初始化效益的一半。在某些設定中，僅使用環境預測損失就能實現無驗證器的自我改進，使策略僅透過與環境互動的學習便能改善未見過的領域外任務。綜合這些結果顯示，環境觀察不僅是後續動作的上下文，更是每一輪推論中已然存在的、基於當前策略的密集監督訊號。

English

CLI agents are the closest thing language models have to an embodied setting: the model emits commands, the terminal executes them, and the returned stream -- stdout, errors, files, logs, and traces -- records the consequences. We argue that this stream is a supervision signal, but standard agent RL discards it: GRPO-style training updates action tokens with sparse outcome-level rewards while ignoring environment responses already in the rollout. Failed rollouts provide little policy-gradient signal despite containing rich evidence about how the environment responds. We introduce ECHO (Environment Cross-entropy Hybrid Objective), a hybrid objective that combines the standard policy-gradient loss on action tokens with an auxiliary loss that trains the policy to predict environment observation tokens resulting from its own actions. ECHO reuses the same forward pass as GRPO, requires no additional rollouts, and turns terminal feedback into dense supervision for all rollouts. ECHO doubles GRPO pass@1 on TerminalBench-2.0: Qwen3-8B improves from 2.70% to 5.17%, and Qwen3-14B from 5.17% to 10.79%. ECHO also produces policies that better predict terminal dynamics, even on trajectories they did not generate: across held-out rollouts, it sharply reduces environment-token cross-entropy while GRPO alone barely changes it. From base Qwen3-8B, ECHO matches expert-SFT-then-GRPO performance on held-out terminal tasks without expert demonstrations, and recovers roughly half of the expert-SFT initialization benefit on TerminalBench-2.0. In some settings, the environment prediction loss alone enables verifier-free self-improvement, allowing policies to improve on unseen OOD tasks by learning only from environment interactions. Together, these results suggest that environment observations are not merely context for future actions, but a dense, on-policy supervision signal already present in every rollout.