测试时策略优化：提升与大型语言模型的多轮交互效果

摘要

大型语言模型（LLMs）采用多轮交互作为完成复杂任务的基本范式。然而，在长时间交互中，其性能往往下降，因为它们通常是在静态的单轮数据上训练的，这限制了它们适应实时用户反馈的能力。为解决这一局限，我们首先提出了一种新范式：多轮交互中的测试时策略适应（T2PAM），它利用当前交互中的用户反馈作为奖励信号，估计与用户偏好一致的潜在最优策略，然后更新一小部分参数以引导模型朝向该策略，最终实现高效的对话中自我修正。接着，我们引入了最优参考单步适应（ROSA），一种轻量级算法，将T2PAM付诸实践。ROSA通过一次高效的更新步骤，引导模型参数向理论最优策略靠拢，避免了代价高昂的基于梯度的迭代优化，并最小化了计算开销。我们提供了严格的理论分析，确保随着交互次数的增加，ROSA的策略会收敛至用户偏好。在具有挑战性的基准测试上的广泛实验表明，ROSA在任务效果和效率上均取得了显著提升。

English

Large Language Models (LLMs) employ multi-turn interaction as a fundamental paradigm for completing complex tasks. However, their performance often degrades in extended interactions, as they are typically trained on static, single-turn data, which hinders their ability to adapt to real-time user feedback. To address this limitation, we first propose a new paradigm: Test-Time Policy Adaptation for Multi-Turn Interactions (T2PAM), which utilizes user feedback from the ongoing interaction as a reward signal to estimate a latent optimal policy aligned with user preferences, then updates a small subset of parameters to steer the model toward this policy, ultimately enabling efficient in-conversation self-correction. We then introduce Optimum-Referenced One-Step Adaptation (ROSA), a lightweight algorithm that operationalizes T2PAM. ROSA guides the model parameters toward a theoretical optimal policy in a single, efficient update step, avoiding costly iterative gradient-based optimization and minimizing computational overhead. We provide a rigorous theoretical analysis guaranteeing that the policy of ROSA converges to the preference of user as the number of interactions increases. Extensive experiments on challenging benchmark demonstrate that ROSA achieves significant improvements in both task effectiveness and efficiency.

测试时策略优化：提升与大型语言模型的多轮交互效果

Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs

摘要

Support