大規模言語モデルの知能を評価するための単語推測ゲームの活用

要旨

LLMベースのエージェント知能の自動評価は、高度なLLMベースのエージェントの開発において極めて重要です。AlpacaEvalのような人間による注釈付き評価データセットの開発には多大な努力が払われてきましたが、既存の手法はコストが高く、時間がかかり、適応性に欠けています。本論文では、人気のある言語ゲーム「Who is Spy」に着想を得て、単語当てゲームを用いてLLMの知能性能を評価することを提案します。与えられた単語に対して、LLMはその単語を説明し、自身や他のプレイヤーの説明に基づいてその正体（スパイかどうか）を判断するよう求められます。理想的には、高度なエージェントは、攻撃的な説明を用いて与えられた単語を正確に説明する能力を持ちつつ、保守的な説明において混乱を最大化し、ゲームへの参加を強化するべきです。この目的のために、まずLLMの表現能力と偽装能力を評価するDEEPを開発します。DEEPでは、LLMに攻撃的モードと保守的モードで単語を説明させます。次に、SpyGameを導入します。これは、競争的な言語ベースのボードゲームへの参加を通じてLLMの知能を評価するためのインタラクティブなマルチエージェントフレームワークです。マルチエージェントインタラクションを取り入れたSpyGameは、対象のLLMに言語スキルと戦略的思考を要求し、複雑なコミュニケーション状況におけるLLMの人間らしい認知能力と適応性をより包括的に評価します。提案する評価フレームワークは非常に実装が容易です。我々は複数のソース、ドメイン、言語から単語を収集し、提案した評価フレームワークを用いて実験を行いました。大規模な実験により、提案したDEEPとSpyGameが様々なLLMの能力を効果的に評価し、新しい状況への適応能力や戦略的コミュニケーション能力を捉えることが実証されました。

English

The automatic evaluation of LLM-based agent intelligence is critical in developing advanced LLM-based agents. Although considerable effort has been devoted to developing human-annotated evaluation datasets, such as AlpacaEval, existing techniques are costly, time-consuming, and lack adaptability. In this paper, inspired by the popular language game ``Who is Spy'', we propose to use the word guessing game to assess the intelligence performance of LLMs. Given a word, the LLM is asked to describe the word and determine its identity (spy or not) based on its and other players' descriptions. Ideally, an advanced agent should possess the ability to accurately describe a given word using an aggressive description while concurrently maximizing confusion in the conservative description, enhancing its participation in the game. To this end, we first develop DEEP to evaluate LLMs' expression and disguising abilities. DEEP requires LLM to describe a word in aggressive and conservative modes. We then introduce SpyGame, an interactive multi-agent framework designed to assess LLMs' intelligence through participation in a competitive language-based board game. Incorporating multi-agent interaction, SpyGame requires the target LLM to possess linguistic skills and strategic thinking, providing a more comprehensive evaluation of LLMs' human-like cognitive abilities and adaptability in complex communication situations. The proposed evaluation framework is very easy to implement. We collected words from multiple sources, domains, and languages and used the proposed evaluation framework to conduct experiments. Extensive experiments demonstrate that the proposed DEEP and SpyGame effectively evaluate the capabilities of various LLMs, capturing their ability to adapt to novel situations and engage in strategic communication.

大規模言語モデルの知能を評価するための単語推測ゲームの活用

Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

要旨

Support