言語モデルの交渉能力向上のためのセルフプレイとAIフィードバックを用いたインコンテキスト学習

要旨

複数の大規模言語モデル（LLM）が、交渉ゲームを通じてプレイ、反省、批評を行うことで、互いに自律的に改善できるかどうかを研究します。この問いに興味を持つ理由は、LLMが互いに改善できるのであれば、人間の介入を最小限に抑えた強力なAIエージェントを作成する可能性が示唆されるためです。2つのLLMに買い手と売り手の役割をそれぞれ割り当て、交渉を行わせます。彼らの目標は、買い手がより低い価格を、売り手がより高い価格を目指して取引を成立させることです。批評家役の第三の言語モデルは、プレイヤーの交渉戦略を改善するためのフィードバックを提供します。2つのエージェントに複数ラウンドをプレイさせ、以前の交渉履歴とAIからのフィードバックを文脈内のデモンストレーションとして使用し、モデルの交渉戦略を反復的に改善させます。異なる役割には異なるLLM（GPTとClaude）を使用し、取引価格を評価指標とします。実験からは、以下のような興味深い知見が得られました：（1）検討した言語モデルのうち、一部のモデルのみが自己プレイを通じてAIフィードバックから取引価格を改善でき、弱いモデルはゲームのルールを理解できないか、AIフィードバックを取り入れてさらに改善することができません。（2）モデルがフィードバックから学ぶ能力は、異なる役割を演じる際に異なります。例えば、Claude-instantは買い手としてよりも売り手としての方が改善しにくいです。（3）ゲームを複数ラウンドに展開すると、強力なエージェントは以前の経験と反復的なAIフィードバックを有意義に活用して一貫してパフォーマンスを向上させることができますが、取引が破綻するリスクも高くなります。本研究が、ゲームプレイとAIフィードバックを通じてモデルが互いに自律的に改善するための洞察に富んだ初期の探求となることを期待しています。

English

We study whether multiple large language models (LLMs) can autonomously improve each other in a negotiation game by playing, reflecting, and criticizing. We are interested in this question because if LLMs were able to improve each other, it would imply the possibility of creating strong AI agents with minimal human intervention. We ask two LLMs to negotiate with each other, playing the roles of a buyer and a seller, respectively. They aim to reach a deal with the buyer targeting a lower price and the seller a higher one. A third language model, playing the critic, provides feedback to a player to improve the player's negotiation strategies. We let the two agents play multiple rounds, using previous negotiation history and AI feedback as in-context demonstrations to improve the model's negotiation strategy iteratively. We use different LLMs (GPT and Claude) for different roles and use the deal price as the evaluation metric. Our experiments reveal multiple intriguing findings: (1) Only a subset of the language models we consider can self-play and improve the deal price from AI feedback, weaker models either do not understand the game's rules or cannot incorporate AI feedback for further improvement. (2) Models' abilities to learn from the feedback differ when playing different roles. For example, it is harder for Claude-instant to improve as the buyer than as the seller. (3) When unrolling the game to multiple rounds, stronger agents can consistently improve their performance by meaningfully using previous experiences and iterative AI feedback, yet have a higher risk of breaking the deal. We hope our work provides insightful initial explorations of having models autonomously improve each other with game playing and AI feedback.

言語モデルの交渉能力向上のためのセルフプレイとAIフィードバックを用いたインコンテキスト学習

Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback

要旨

Support