LLMを用いた多段階対話の強化のためのテスト時ポリシー適応

要旨

大規模言語モデル（LLMs）は、複雑なタスクを遂行するための基本的なパラダイムとして多ターンインタラクションを採用している。しかし、これらのモデルは通常、静的な単一ターンのデータで訓練されているため、長時間のインタラクションにおいて性能が低下し、リアルタイムのユーザーフィードバックに適応する能力が制限される。この課題に対処するため、我々はまず新しいパラダイムを提案する：多ターンインタラクションのためのテスト時ポリシー適応（T2PAM）。T2PAMは、進行中のインタラクションからのユーザーフィードバックを報酬信号として利用し、ユーザーの嗜好に沿った潜在的な最適ポリシーを推定し、その後、モデルのパラメータの一部を更新してこのポリシーに向けて誘導し、最終的に対話中の自己修正を効率的に可能にする。次に、T2PAMを実現する軽量なアルゴリズムである最適参照ワンステップ適応（ROSA）を導入する。ROSAは、理論上の最適ポリシーに向けてモデルパラメータを単一の効率的な更新ステップで誘導し、コストのかかる反復的な勾配ベースの最適化を回避し、計算オーバーヘッドを最小化する。我々は、インタラクションの回数が増えるにつれてROSAのポリシーがユーザーの嗜好に収束することを保証する厳密な理論分析を提供する。挑戦的なベンチマークでの広範な実験により、ROSAがタスクの有効性と効率の両方において大幅な改善を達成することが示された。

English

Large Language Models (LLMs) employ multi-turn interaction as a fundamental paradigm for completing complex tasks. However, their performance often degrades in extended interactions, as they are typically trained on static, single-turn data, which hinders their ability to adapt to real-time user feedback. To address this limitation, we first propose a new paradigm: Test-Time Policy Adaptation for Multi-Turn Interactions (T2PAM), which utilizes user feedback from the ongoing interaction as a reward signal to estimate a latent optimal policy aligned with user preferences, then updates a small subset of parameters to steer the model toward this policy, ultimately enabling efficient in-conversation self-correction. We then introduce Optimum-Referenced One-Step Adaptation (ROSA), a lightweight algorithm that operationalizes T2PAM. ROSA guides the model parameters toward a theoretical optimal policy in a single, efficient update step, avoiding costly iterative gradient-based optimization and minimizing computational overhead. We provide a rigorous theoretical analysis guaranteeing that the policy of ROSA converges to the preference of user as the number of interactions increases. Extensive experiments on challenging benchmark demonstrate that ROSA achieves significant improvements in both task effectiveness and efficiency.

LLMを用いた多段階対話の強化のためのテスト時ポリシー適応

Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs

要旨

Support