利用猜词游戏评估大型语言模型的智能

摘要

LLM-based智能代理的自动评估对于开发先进的LLM-based代理至关重要。尽管已经付出了相当大的努力来开发人工注释的评估数据集，例如AlpacaEval，但现有技术昂贵、耗时且缺乏适应性。在本文中，受流行语言游戏“谁是卧底”的启发，我们提出使用猜词游戏来评估LLM的智能表现。给定一个词，要求LLM描述这个词并根据自己和其他玩家的描述确定其身份（卧底或非卧底）。理想情况下，一个先进的代理应该具备准确描述给定词汇的能力，同时在保守描述中最大程度地制造混淆，增强其在游戏中的参与度。为此，我们首先开发了DEEP来评估LLM的表达和伪装能力。DEEP要求LLM以激进和保守的方式描述一个词汇。然后，我们引入了SpyGame，这是一个互动式多代理框架，旨在通过参与竞争性基于语言的棋盘游戏来评估LLM的智能。SpyGame融入了多代理互动，要求目标LLM具备语言技能和战略思维，提供了对LLM类人认知能力和在复杂沟通情境中的适应能力更全面的评估。所提出的评估框架非常易于实施。我们从多个来源、领域和语言收集了词汇，并使用所提出的评估框架进行实验。大量实验证明，所提出的DEEP和SpyGame有效评估了各种LLM的能力，捕捉了它们适应新情况并进行战略沟通的能力。

English

The automatic evaluation of LLM-based agent intelligence is critical in developing advanced LLM-based agents. Although considerable effort has been devoted to developing human-annotated evaluation datasets, such as AlpacaEval, existing techniques are costly, time-consuming, and lack adaptability. In this paper, inspired by the popular language game ``Who is Spy'', we propose to use the word guessing game to assess the intelligence performance of LLMs. Given a word, the LLM is asked to describe the word and determine its identity (spy or not) based on its and other players' descriptions. Ideally, an advanced agent should possess the ability to accurately describe a given word using an aggressive description while concurrently maximizing confusion in the conservative description, enhancing its participation in the game. To this end, we first develop DEEP to evaluate LLMs' expression and disguising abilities. DEEP requires LLM to describe a word in aggressive and conservative modes. We then introduce SpyGame, an interactive multi-agent framework designed to assess LLMs' intelligence through participation in a competitive language-based board game. Incorporating multi-agent interaction, SpyGame requires the target LLM to possess linguistic skills and strategic thinking, providing a more comprehensive evaluation of LLMs' human-like cognitive abilities and adaptability in complex communication situations. The proposed evaluation framework is very easy to implement. We collected words from multiple sources, domains, and languages and used the proposed evaluation framework to conduct experiments. Extensive experiments demonstrate that the proposed DEEP and SpyGame effectively evaluate the capabilities of various LLMs, capturing their ability to adapt to novel situations and engage in strategic communication.

利用猜词游戏评估大型语言模型的智能

Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

摘要

Support