LLM과의 다중 턴 상호작용 강화를 위한 테스트 시간 정책 적응

초록

대형 언어 모델(LLMs)은 복잡한 작업을 완수하기 위한 기본 패러다임으로 다중 턴 상호작용을 활용합니다. 그러나 이러한 모델들은 일반적으로 정적이고 단일 턴 데이터로 훈련되기 때문에, 실시간 사용자 피드백에 적응하는 능력이 제한되어 확장된 상호작용에서 성능이 저하되는 경우가 많습니다. 이러한 한계를 해결하기 위해, 우리는 먼저 새로운 패러다임인 다중 턴 상호작용을 위한 테스트 타임 정책 적응(T2PAM)을 제안합니다. T2PAM은 진행 중인 상호작용에서의 사용자 피드백을 보상 신호로 활용하여 사용자 선호도와 일치하는 잠재적 최적 정책을 추정한 후, 모델을 이 정책으로 유도하기 위해 소수의 매개변수를 업데이트함으로써 대화 중 자가 수정을 효율적으로 가능하게 합니다. 이어서, 우리는 T2PAM을 구현하는 경량 알고리즘인 최적 참조 일회 적응(ROSA)을 소개합니다. ROSA는 이론적 최적 정책을 향해 모델 매개변수를 단일, 효율적인 업데이트 단계로 유도하며, 비용이 많이 드는 반복적 경사 기반 최적화를 피하고 계산 오버헤드를 최소화합니다. 우리는 ROSA의 정책이 상호작용 횟수가 증가함에 따라 사용자의 선호도로 수렴함을 보장하는 엄밀한 이론적 분석을 제공합니다. 도전적인 벤치마크에서의 광범위한 실험을 통해 ROSA가 작업 효과성과 효율성 모두에서 상당한 개선을 달성함을 입증합니다.

English

Large Language Models (LLMs) employ multi-turn interaction as a fundamental paradigm for completing complex tasks. However, their performance often degrades in extended interactions, as they are typically trained on static, single-turn data, which hinders their ability to adapt to real-time user feedback. To address this limitation, we first propose a new paradigm: Test-Time Policy Adaptation for Multi-Turn Interactions (T2PAM), which utilizes user feedback from the ongoing interaction as a reward signal to estimate a latent optimal policy aligned with user preferences, then updates a small subset of parameters to steer the model toward this policy, ultimately enabling efficient in-conversation self-correction. We then introduce Optimum-Referenced One-Step Adaptation (ROSA), a lightweight algorithm that operationalizes T2PAM. ROSA guides the model parameters toward a theoretical optimal policy in a single, efficient update step, avoiding costly iterative gradient-based optimization and minimizing computational overhead. We provide a rigorous theoretical analysis guaranteeing that the policy of ROSA converges to the preference of user as the number of interactions increases. Extensive experiments on challenging benchmark demonstrate that ROSA achieves significant improvements in both task effectiveness and efficiency.

LLM과의 다중 턴 상호작용 강화를 위한 테스트 시간 정책 적응

Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs

초록

Support