测试时策略优化:提升与大型语言模型的多轮交互效果
Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs
September 27, 2025
作者: Chenxing Wei, Hong Wang, Ying He, Fei Yu, Yao Shu
cs.AI
摘要
大型语言模型(LLMs)采用多轮交互作为完成复杂任务的基本范式。然而,在长时间交互中,其性能往往下降,因为它们通常是在静态的单轮数据上训练的,这限制了它们适应实时用户反馈的能力。为解决这一局限,我们首先提出了一种新范式:多轮交互中的测试时策略适应(T2PAM),它利用当前交互中的用户反馈作为奖励信号,估计与用户偏好一致的潜在最优策略,然后更新一小部分参数以引导模型朝向该策略,最终实现高效的对话中自我修正。接着,我们引入了最优参考单步适应(ROSA),一种轻量级算法,将T2PAM付诸实践。ROSA通过一次高效的更新步骤,引导模型参数向理论最优策略靠拢,避免了代价高昂的基于梯度的迭代优化,并最小化了计算开销。我们提供了严格的理论分析,确保随着交互次数的增加,ROSA的策略会收敛至用户偏好。在具有挑战性的基准测试上的广泛实验表明,ROSA在任务效果和效率上均取得了显著提升。
English
Large Language Models (LLMs) employ multi-turn interaction as a fundamental
paradigm for completing complex tasks. However, their performance often
degrades in extended interactions, as they are typically trained on static,
single-turn data, which hinders their ability to adapt to real-time user
feedback. To address this limitation, we first propose a new paradigm:
Test-Time Policy Adaptation for Multi-Turn Interactions (T2PAM), which utilizes
user feedback from the ongoing interaction as a reward signal to estimate a
latent optimal policy aligned with user preferences, then updates a small
subset of parameters to steer the model toward this policy, ultimately enabling
efficient in-conversation self-correction. We then introduce Optimum-Referenced
One-Step Adaptation (ROSA), a lightweight algorithm that operationalizes T2PAM.
ROSA guides the model parameters toward a theoretical optimal policy in a
single, efficient update step, avoiding costly iterative gradient-based
optimization and minimizing computational overhead. We provide a rigorous
theoretical analysis guaranteeing that the policy of ROSA converges to the
preference of user as the number of interactions increases. Extensive
experiments on challenging benchmark demonstrate that ROSA achieves significant
improvements in both task effectiveness and efficiency.