框架更新並非框架效益：釐清自我演化大型語言模型代理中的演化能力

摘要

LLM代理日益被部署為建構於可編輯外部支架（包括提示、技能、記憶與工具）之上的系統，這些支架在不改變模型參數的前提下塑造任務執行過程。支架自我演化透過從執行證據更新這些支架來適應此類代理。然而，模型在任務求解上的基礎能力是否能預測其在支架自我演化中的能力，目前仍不清楚：哪些模型能產生有用的支架更新，以及哪些模型能真正從中受益？我們分析了兩種支架自我演化能力：（i）支架更新能力，即從執行證據產生有用且持久之支架更新的能力；（ii）支架受益能力，即在任務求解過程中從更新後的支架獲益的能力。我們的分析揭示了兩項發現。首先，支架更新能力在基礎能力上呈現平坦趨勢：來自不同能力層級的模型所產生的支架更新，導致了驚人相似的效果提升；即使是Qwen3.5-9B的更新所帶來的增益也與Claude Opus~4.6相當。其次，支架受益能力在基礎能力上呈現非單調趨勢：弱層級模型從更新支架中獲益甚微，中層級模型獲益最多，而強層級模型獲益則少於中層級。我們將弱層級的增益低落歸因於兩種失敗模式：弱層級模型可能無法啟動相關的支架構件，或者雖啟動卻無法忠實遵循這些構件。這些發現建議將能力預算投入於任務求解代理而非演化器，並在代理訓練中針對支架調用與長程指令遵循進行強化。我們的原始碼已公開於 https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution。

English

LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Harness self-evolution adapts such agents by updating these harnesses from execution evidence. Yet it remains unclear whether a model's base capability in task-solving predicts its capabilities in harness self-evolution: which models produce useful harness updates, and which actually benefit from them? We analyze two harness self-evolution capabilities: (i) harness-updating, the capability to produce useful persistent harness updates from execution evidence; (ii) harness-benefit, the capability to benefit from updated harnesses during task solving. Our analysis reveals two findings. First, harness-updating is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5-9B's updates yield gains comparable to those of Claude Opus~4.6. Second, harness-benefit is non-monotonic in base capability: weak-tier models benefit little from updated harnesses, mid-tier models benefit most, and strong-tier models benefit less than mid-tier. We trace low gains at the weak tier to two failure modes: weak-tier models may fail to activate relevant harness artifacts, or activate them but fail to follow them faithfully. These findings suggest investing capability budget in the task-solving agent rather than the evolver, and targeting harness invocation and long-horizon instruction following in agent training. Our source code is publicly available at https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution.