测试时自我提升的大语言模型代理
Self-Improving LLM Agents at Test-Time
October 9, 2025
作者: Emre Can Acikgoz, Cheng Qian, Heng Ji, Dilek Hakkani-Tür, Gokhan Tur
cs.AI
摘要
语言模型(LM)微调的一种范式依赖于创建大规模训练数据集,其假设是数据的高数量与多样性将使模型在训练后能够泛化到新任务。然而,实践中,收集大量数据集效率低下,且训练成本高昂;更糟的是,无法保证最终模型能处理复杂场景或实现更好的泛化。此外,现有技术很少评估训练样本是否提供了新信息,或与模型已掌握的知识重复,导致不必要的开销。本研究中,我们探索了一种新的测试时自我改进方法,旨在动态创建更有效且泛化能力更强的智能语言模型。所提出的算法可概括为三个步骤:(i) 首先识别模型难以处理的样本(自我认知),(ii) 然后从检测到的不确定样本中生成类似示例(自我数据增强),(iii) 在测试时微调中使用这些新生成的样本(自我改进)。我们研究了该方法的两种变体:测试时自我改进(TT-SI),即同一模型从其自身不确定案例中生成额外训练样本并从中学习;以及测试时蒸馏(TT-D),即由更强模型为不确定案例生成类似样本,使学生模型能够通过蒸馏监督进行适应。跨不同智能体基准的实证评估表明,TT-SI在所有基准上平均提升了+5.48%的绝对准确率,且仅使用了68倍少的训练样本,超越了其他标准学习方法。我们的发现凸显了TT-SI的潜力,展示了测试时自我改进算法作为一种新范式,在构建更具自我进化能力的智能体方面的广阔前景。
English
One paradigm of language model (LM) fine-tuning relies on creating large
training datasets, under the assumption that high quantity and diversity will
enable models to generalize to novel tasks after post-training. In practice,
gathering large sets of data is inefficient, and training on them is
prohibitively expensive; worse, there is no guarantee that the resulting model
will handle complex scenarios or generalize better. Moreover, existing
techniques rarely assess whether a training sample provides novel information
or is redundant with the knowledge already acquired by the model, resulting in
unnecessary costs. In this work, we explore a new test-time self-improvement
method to create more effective and generalizable agentic LMs on-the-fly. The
proposed algorithm can be summarized in three steps: (i) first it identifies
the samples that model struggles with (self-awareness), (ii) then generates
similar examples from detected uncertain samples (self-data augmentation), and
(iii) uses these newly generated samples at test-time fine-tuning
(self-improvement). We study two variants of this approach: Test-Time
Self-Improvement (TT-SI), where the same model generates additional training
examples from its own uncertain cases and then learns from them, and contrast
this approach with Test-Time Distillation (TT-D), where a stronger model
generates similar examples for uncertain cases, enabling student to adapt using
distilled supervision. Empirical evaluations across different agent benchmarks
demonstrate that TT-SI improves the performance with +5.48% absolute accuracy
gain on average across all benchmarks and surpasses other standard learning
methods, yet using 68x less training samples. Our findings highlight the
promise of TT-SI, demonstrating the potential of self-improvement algorithms at
test-time as a new paradigm for building more capable agents toward
self-evolution.