測試時策略適應以增強與大型語言模型的多輪互動

摘要

大型語言模型（LLMs）採用多輪互動作為完成複雜任務的基本範式。然而，在長時間的互動中，其性能往往會下降，因為它們通常是在靜態、單輪數據上進行訓練的，這阻礙了它們適應實時用戶反饋的能力。為了解決這一限制，我們首先提出了一種新範式：多輪互動的測試時策略適應（T2PAM），該範式利用正在進行的互動中的用戶反饋作為獎勵信號，以估計與用戶偏好一致的潛在最優策略，然後更新一小部分參數，引導模型朝向這一策略，最終實現高效的對話中自我校正。接著，我們介紹了基於理論最優策略的一步適應算法（ROSA），這是一種輕量級算法，能夠實現T2PAM。ROSA在單次高效的更新步驟中引導模型參數朝向理論最優策略，避免了代價高昂的基於梯度的迭代優化，並最小化了計算開銷。我們提供了嚴謹的理論分析，保證隨著互動次數的增加，ROSA的策略會收斂到用戶的偏好。在具有挑戰性的基準測試上進行的廣泛實驗表明，ROSA在任務有效性和效率方面均取得了顯著的提升。

English

Large Language Models (LLMs) employ multi-turn interaction as a fundamental paradigm for completing complex tasks. However, their performance often degrades in extended interactions, as they are typically trained on static, single-turn data, which hinders their ability to adapt to real-time user feedback. To address this limitation, we first propose a new paradigm: Test-Time Policy Adaptation for Multi-Turn Interactions (T2PAM), which utilizes user feedback from the ongoing interaction as a reward signal to estimate a latent optimal policy aligned with user preferences, then updates a small subset of parameters to steer the model toward this policy, ultimately enabling efficient in-conversation self-correction. We then introduce Optimum-Referenced One-Step Adaptation (ROSA), a lightweight algorithm that operationalizes T2PAM. ROSA guides the model parameters toward a theoretical optimal policy in a single, efficient update step, avoiding costly iterative gradient-based optimization and minimizing computational overhead. We provide a rigorous theoretical analysis guaranteeing that the policy of ROSA converges to the preference of user as the number of interactions increases. Extensive experiments on challenging benchmark demonstrate that ROSA achieves significant improvements in both task effectiveness and efficiency.

測試時策略適應以增強與大型語言模型的多輪互動

Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs

摘要

Support