GLEE:一個統一的框架和基準,用於基於語言的經濟環境。
GLEE: A Unified Framework and Benchmark for Language-based Economic Environments
October 7, 2024
作者: Eilam Shapira, Omer Madmon, Itamar Reinman, Samuel Joseph Amouyal, Roi Reichart, Moshe Tennenholtz
cs.AI
摘要
大型語言模型(LLMs)在經濟和戰略互動中展現顯著潛力,其中通過自然語言進行溝通往往很普遍。這引發了一些關鍵問題:LLMs是否表現理性?它們能模仿人類行為嗎?它們是否傾向於達到有效和公平的結果?自然語言在戰略互動中扮演什麼角色?經濟環境的特徵如何影響這些動態?這些問題對於將基於LLM的代理集成到現實世界的數據驅動系統(如在線零售平台和推薦系統)的經濟和社會影響至關重要。儘管機器學習社區一直在探索LLMs在這種多代理設置中的潛力,但研究中的各種假設、設計選擇和評估標準的差異使得很難得出堅固且有意義的結論。為了應對這一挑戰,我們引入了一個基準,用於標準化關於雙方、連續、基於語言的遊戲的研究。受經濟文獻的啟發,我們定義了三個基本遊戲家族,具有一致的參數化、自由度和經濟度量,以評估代理的表現(自我收益)以及遊戲結果(效率和公平性)。我們開發了一個開源框架用於交互模擬和分析,並利用它收集了一組LLM對LLM互動的數據集,涵蓋眾多遊戲配置,以及一組人類對LLM互動的額外數據集。通過廣泛的實驗,我們展示了我們的框架和數據集如何用於:(i)比較LLM代理在不同經濟背景下與人類玩家的行為;(ii)評估代理在個人和集體表現指標上的表現;以及(iii)量化環境的經濟特徵對代理行為的影響。
English
Large Language Models (LLMs) show significant potential in economic and
strategic interactions, where communication via natural language is often
prevalent. This raises key questions: Do LLMs behave rationally? Can they mimic
human behavior? Do they tend to reach an efficient and fair outcome? What is
the role of natural language in the strategic interaction? How do
characteristics of the economic environment influence these dynamics? These
questions become crucial concerning the economic and societal implications of
integrating LLM-based agents into real-world data-driven systems, such as
online retail platforms and recommender systems. While the ML community has
been exploring the potential of LLMs in such multi-agent setups, varying
assumptions, design choices and evaluation criteria across studies make it
difficult to draw robust and meaningful conclusions. To address this, we
introduce a benchmark for standardizing research on two-player, sequential,
language-based games. Inspired by the economic literature, we define three base
families of games with consistent parameterization, degrees of freedom and
economic measures to evaluate agents' performance (self-gain), as well as the
game outcome (efficiency and fairness). We develop an open-source framework for
interaction simulation and analysis, and utilize it to collect a dataset of LLM
vs. LLM interactions across numerous game configurations and an additional
dataset of human vs. LLM interactions. Through extensive experimentation, we
demonstrate how our framework and dataset can be used to: (i) compare the
behavior of LLM-based agents to human players in various economic contexts;
(ii) evaluate agents in both individual and collective performance measures;
and (iii) quantify the effect of the economic characteristics of the
environments on the behavior of agents.Summary
AI-Generated Summary