測試時自我改進的大型語言模型代理
Self-Improving LLM Agents at Test-Time
October 9, 2025
作者: Emre Can Acikgoz, Cheng Qian, Heng Ji, Dilek Hakkani-Tür, Gokhan Tur
cs.AI
摘要
語言模型(LM)微調的一種範式依賴於創建大型訓練數據集,其假設是數據的高數量和多樣性將使模型在訓練後能夠泛化到新任務。然而,實際上,收集大量數據效率低下,且對其進行訓練成本高昂;更糟糕的是,無法保證最終模型能夠處理複雜場景或更好地泛化。此外,現有技術很少評估訓練樣本是否提供了新信息,或者是否與模型已獲得的知識重複,從而導致不必要的成本。在本研究中,我們探索了一種新的測試時自我改進方法,以即時創建更有效且更具泛化能力的自主語言模型。所提出的算法可概括為三個步驟:(i) 首先識別模型難以處理的樣本(自我意識),(ii) 然後從檢測到的不確定樣本中生成類似示例(自我數據增強),(iii) 在測試時微調中使用這些新生成的樣本(自我改進)。我們研究了該方法的兩種變體:測試時自我改進(TT-SI),其中同一模型從其自身的不確定案例中生成額外的訓練示例並從中學習,以及測試時蒸餾(TT-D),其中更強的模型為不確定案例生成類似示例,使學生模型能夠通過蒸餾監督進行適應。在不同代理基準上的實證評估表明,TT-SI 在所有基準上平均提升了 +5.48% 的絕對準確率,並超越了其他標準學習方法,同時使用的訓練樣本減少了 68 倍。我們的研究結果凸顯了 TT-SI 的潛力,展示了測試時自我改進算法作為構建更強大代理以實現自我進化的新範式的可能性。
English
One paradigm of language model (LM) fine-tuning relies on creating large
training datasets, under the assumption that high quantity and diversity will
enable models to generalize to novel tasks after post-training. In practice,
gathering large sets of data is inefficient, and training on them is
prohibitively expensive; worse, there is no guarantee that the resulting model
will handle complex scenarios or generalize better. Moreover, existing
techniques rarely assess whether a training sample provides novel information
or is redundant with the knowledge already acquired by the model, resulting in
unnecessary costs. In this work, we explore a new test-time self-improvement
method to create more effective and generalizable agentic LMs on-the-fly. The
proposed algorithm can be summarized in three steps: (i) first it identifies
the samples that model struggles with (self-awareness), (ii) then generates
similar examples from detected uncertain samples (self-data augmentation), and
(iii) uses these newly generated samples at test-time fine-tuning
(self-improvement). We study two variants of this approach: Test-Time
Self-Improvement (TT-SI), where the same model generates additional training
examples from its own uncertain cases and then learns from them, and contrast
this approach with Test-Time Distillation (TT-D), where a stronger model
generates similar examples for uncertain cases, enabling student to adapt using
distilled supervision. Empirical evaluations across different agent benchmarks
demonstrate that TT-SI improves the performance with +5.48% absolute accuracy
gain on average across all benchmarks and surpasses other standard learning
methods, yet using 68x less training samples. Our findings highlight the
promise of TT-SI, demonstrating the potential of self-improvement algorithms at
test-time as a new paradigm for building more capable agents toward
self-evolution.