GLEE: 言語ベースの経済環境のための統一フレームワークとベンチマーク

要旨

大規模言語モデル（LLM）は、自然言語を介したコミュニケーションが頻繁に行われる経済的および戦略的相互作用において、著しい潜在能力を示しています。これにより、重要な問題が提起されます：LLMは合理的に振る舞うのか？彼らは人間の行動を模倣できるのか？効率的かつ公正な結果に到達する傾向があるのか？戦略的相互作用における自然言語の役割は何か？経済環境の特性がこれらのダイナミクスにどのように影響するのか？これらの問題は、LLMベースのエージェントをオンライン小売プラットフォームや推薦システムなどの実世界のデータ駆動型システムに統合することの経済的および社会的影響に関して重要となります。機械学習コミュニティは、このような多エージェントセットアップにおけるLLMの潜在能力を探求してきましたが、研究間での異なる仮定、設計選択肢、評価基準により、堅牢で意味のある結論を導くことが難しくなっています。この課題に対処するために、2人対戦、順次、言語ベースのゲームに関する研究を標準化するためのベンチマークを導入します。経済学の文献に触発され、一貫したパラメータ化、自由度、経済的指標を持つ3つの基本ゲームファミリーを定義し、エージェントのパフォーマンス（自己利益）およびゲームの結果（効率性と公正性）を評価するための枠組みを開発します。相互作用シミュレーションと分析のためのオープンソースフレームワークを開発し、これを使用して、多数のゲーム構成にわたるLLM対LLMの相互作用のデータセットと、人間対LLMの相互作用の追加データセットを収集します。包括的な実験を通じて、当社のフレームワークとデータセットが以下のように使用できることを示し：（i）様々な経済的文脈でLLMベースのエージェントの振る舞いを人間プレイヤーと比較する；（ii）個々および集団のパフォーマンス指標でエージェントを評価する；および（iii）環境の経済的特性がエージェントの振る舞いに与える影響を数量化する。

English

Large Language Models (LLMs) show significant potential in economic and strategic interactions, where communication via natural language is often prevalent. This raises key questions: Do LLMs behave rationally? Can they mimic human behavior? Do they tend to reach an efficient and fair outcome? What is the role of natural language in the strategic interaction? How do characteristics of the economic environment influence these dynamics? These questions become crucial concerning the economic and societal implications of integrating LLM-based agents into real-world data-driven systems, such as online retail platforms and recommender systems. While the ML community has been exploring the potential of LLMs in such multi-agent setups, varying assumptions, design choices and evaluation criteria across studies make it difficult to draw robust and meaningful conclusions. To address this, we introduce a benchmark for standardizing research on two-player, sequential, language-based games. Inspired by the economic literature, we define three base families of games with consistent parameterization, degrees of freedom and economic measures to evaluate agents' performance (self-gain), as well as the game outcome (efficiency and fairness). We develop an open-source framework for interaction simulation and analysis, and utilize it to collect a dataset of LLM vs. LLM interactions across numerous game configurations and an additional dataset of human vs. LLM interactions. Through extensive experimentation, we demonstrate how our framework and dataset can be used to: (i) compare the behavior of LLM-based agents to human players in various economic contexts; (ii) evaluate agents in both individual and collective performance measures; and (iii) quantify the effect of the economic characteristics of the environments on the behavior of agents.

GLEE: 言語ベースの経済環境のための統一フレームワークとベンチマーク

GLEE: A Unified Framework and Benchmark for Language-based Economic Environments

要旨

Support