利用猜詞遊戲評估大型語言模型的智能

摘要

在發展先進的以LLM為基礎的智能代理時，LLM-based代理智能的自動評估至關重要。儘管已經付出相當大的努力來開發人工標註的評估數據集，例如AlpacaEval，但現有技術成本高、耗時且缺乏適應性。本文受到流行的語言遊戲“誰是臥底”的啟發，提出使用猜詞遊戲來評估LLM的智能表現。給定一個詞，要求LLM描述這個詞並根據自己和其他玩家的描述來確定其身份（臥底或非臥底）。理想情況下，一個先進的代理應該具備能夠使用積極描述準確描述給定詞的能力，同時在保守描述中最大程度地引起混淆，從而增強其在遊戲中的參與度。為此，我們首先開發了DEEP來評估LLM的表達和偽裝能力。DEEP要求LLM以積極和保守模式描述一個詞。然後，我們引入了SpyGame，這是一個互動式多代理框架，旨在通過參與競爭性基於語言的棋盤遊戲來評估LLM的智能。SpyGame融入多代理互動，要求目標LLM具備語言技能和戰略思維，提供對LLM的人類般認知能力和在複雜溝通情況下的適應能力更全面的評估。所提出的評估框架非常容易實施。我們從多個來源、領域和語言中收集了詞彙，並使用所提出的評估框架進行實驗。廣泛的實驗表明，所提出的DEEP和SpyGame有效地評估了各種LLM的能力，捕捉了它們適應新情況並從事戰略性溝通的能力。

English

The automatic evaluation of LLM-based agent intelligence is critical in developing advanced LLM-based agents. Although considerable effort has been devoted to developing human-annotated evaluation datasets, such as AlpacaEval, existing techniques are costly, time-consuming, and lack adaptability. In this paper, inspired by the popular language game ``Who is Spy'', we propose to use the word guessing game to assess the intelligence performance of LLMs. Given a word, the LLM is asked to describe the word and determine its identity (spy or not) based on its and other players' descriptions. Ideally, an advanced agent should possess the ability to accurately describe a given word using an aggressive description while concurrently maximizing confusion in the conservative description, enhancing its participation in the game. To this end, we first develop DEEP to evaluate LLMs' expression and disguising abilities. DEEP requires LLM to describe a word in aggressive and conservative modes. We then introduce SpyGame, an interactive multi-agent framework designed to assess LLMs' intelligence through participation in a competitive language-based board game. Incorporating multi-agent interaction, SpyGame requires the target LLM to possess linguistic skills and strategic thinking, providing a more comprehensive evaluation of LLMs' human-like cognitive abilities and adaptability in complex communication situations. The proposed evaluation framework is very easy to implement. We collected words from multiple sources, domains, and languages and used the proposed evaluation framework to conduct experiments. Extensive experiments demonstrate that the proposed DEEP and SpyGame effectively evaluate the capabilities of various LLMs, capturing their ability to adapt to novel situations and engage in strategic communication.

利用猜詞遊戲評估大型語言模型的智能

Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

摘要

Support