再思考自我演化 LLM 代理之持續經驗內化

摘要

經驗內化將來自過去互動的情境經驗轉化為可重複使用的參數化能力，為大型語言模型的持續學習提供了有前景的路徑。儘管先前研究主要聚焦於單次迭代遷移，我們發現，在多輪迭代經驗學習下，現有方法會遭遇漸進式能力崩塌，而非累積性提升。我們透過經驗內化的三個關鍵維度系統性地檢驗此失敗現象：(1) **經驗粒度**：我們發現原則層級經驗比實例層級經驗更持久，因為它能有效從軌跡特定細節中抽象出可遷移策略。(2) **經驗注入模式**：我們的分析揭示，逐步注入顯著優於全局注入，因其能將經驗與中間決策狀態對齊，此特性對長時程工具使用至關重要。(3) **內化機制**：我們證明，相較於本質上受限於對學生誘導錯誤狀態進行局部修正的在策略情境蒸餾，對高品質教師軌跡進行離策略情境蒸餾能提供更穩定的訓練信號。綜合這些洞見，我們提出一套簡單而穩健的配方，以實現穩定且可持續的經驗內化，為工程化打造能自我演化與持續學習的大型語言模型提供具體指引。

English

Experience internalization converts contextual experience from past interactions into reusable parametric capability, offering a promising path toward continual learning in large language models (LLMs). While prior work has predominantly focused on single-iteration transfer, we discover that under multi-iteration experience learning, existing methods suffer from a progressive capability collapse rather than compounding improvement. We systematically examine this failure through three vital dimensions of experience internalization: (1) Experience Granularity: We find that principle-level experience is more durable than instance-level experience, as it effectively abstracts transferable strategies away from trajectory-specific details. (2) Experience Injection Pattern: Our analysis reveals that step-wise injection significantly outperforms global injection by aligning experience with intermediate decision states, a property that is critical for long-horizon tool use. (3) Internalization Regime: We demonstrate that off-policy context-distillation on high-quality teacher trajectories provides a substantially more stable training signal than on-policy context-distillation, which is inherently limited by local corrections on student-induced flawed states. Together, these insights yield a simple yet robust recipe for stable and sustainable experience internalization, providing concrete guidance for engineering self-evolving and continually learning LLMs.