ChatPaper.aiChatPaper

GameTalk:訓練大型語言模型進行策略性對話

GameTalk: Training LLMs for Strategic Conversation

January 22, 2026
作者: Victor Conchello Vendrell, Max Ruiz Luyten, Mihaela van der Schaar
cs.AI

摘要

在多智能體環境中進行戰略決策,是大型語言模型(LLMs)面臨的關鍵挑戰,尤其當協調與談判需透過長時間對話展開時。儘管近期研究已探索LLMs在獨立決策任務中的應用,但鮮少關注如何透過對話優化長期目標。我們提出GameTalk框架,透過多輪互動訓練LLMs進行戰略決策。有別於過往聚焦單輪目標或靜態行動預測的研究,我們訓練LLMs在完整對話中優化全局目標。為實現此目標,我們採用GRPO、DPO和STaR等微調方法,整合依賴整體互動的回饋信號。我們在一系列複雜度遞增的遊戲中評估此方法,這些遊戲專為檢驗推理、協調與對手建模等不同面向而設計。實驗結果表明,GameTalk顯著優化未經訓練的模型,尤其在獎勵塑形條件下表現突出,其中DPO持續帶來最強效能增益。這些發現將對話式微調定位為LLMs在互動環境中進行推理、談判與行動的可行發展路徑。
English
Strategic decision-making in multi-agent settings is a key challenge for large language models (LLMs), particularly when coordination and negotiation must unfold over extended conversations. While recent work has explored the use of LLMs in isolated decision tasks, little attention has been given to optimizing long-term objectives through dialogue. We introduce GameTalk, a framework for training LLMs to make strategic decisions via multi-turn interactions. Unlike prior work that focuses on single-turn objectives or static action prediction, we train LLMs to optimize a global objective across full conversations. We achieve this by adapting fine-tuning methods like GRPO, DPO, and STaR to incorporate reward signals that depend on the entire interaction. We evaluate this approach on a suite of increasingly complex games, designed to stress different aspects of reasoning, coordination, and opponent modeling. Our results show that GameTalk significantly outperforms untrained models, especially under reward shaping, with DPO consistently yielding the strongest gains. These findings position conversational fine-tuning as a promising path for LLMs to reason, negotiate, and act in interactive environments.
PDF84January 27, 2026