대규모 언어 모델의 지능 평가를 위한 단어 추측 게임 활용

초록

LLM 기반 에이전트 지능의 자동 평가는 고급 LLM 기반 에이전트 개발에 있어 매우 중요합니다. AlpacaEval과 같은 인간 주석 평가 데이터셋 개발에 상당한 노력이 기울여졌지만, 기존 기술은 비용이 많이 들고 시간이 소요되며 적응성이 부족합니다. 본 논문에서는 인기 있는 언어 게임 "Who is Spy"에서 영감을 받아, 단어 추측 게임을 활용하여 LLM의 지능 성능을 평가하는 방법을 제안합니다. 주어진 단어에 대해 LLM은 해당 단어를 설명하고, 자신과 다른 플레이어들의 설명을 기반으로 그 정체(스파이 여부)를 판단해야 합니다. 이상적으로, 고급 에이전트는 공격적인 설명을 통해 주어진 단어를 정확하게 설명하는 동시에 보수적인 설명에서는 혼란을 극대화하여 게임 참여를 강화할 수 있는 능력을 갖추어야 합니다. 이를 위해, 우리는 먼저 LLM의 표현 및 위장 능력을 평가하기 위해 DEEP을 개발했습니다. DEEP은 LLM이 공격적 및 보수적 모드에서 단어를 설명하도록 요구합니다. 그런 다음, SpyGame이라는 상호작용형 다중 에이전트 프레임워크를 소개합니다. SpyGame은 경쟁적인 언어 기반 보드 게임 참여를 통해 LLM의 지능을 평가하도록 설계되었습니다. 다중 에이전트 상호작용을 통합한 SpyGame은 대상 LLM이 언어 능력과 전략적 사고를 갖추도록 요구함으로써, 복잡한 의사소통 상황에서 LLM의 인간과 유사한 인지 능력과 적응성을 보다 포괄적으로 평가합니다. 제안된 평가 프레임워크는 구현이 매우 쉽습니다. 우리는 다양한 출처, 도메인 및 언어에서 단어를 수집하고 제안된 평가 프레임워크를 사용하여 실험을 수행했습니다. 광범위한 실험을 통해 제안된 DEEP과 SpyGame이 다양한 LLM의 능력을 효과적으로 평가하며, 새로운 상황에 적응하고 전략적 의사소통에 참여하는 능력을 포착함을 입증했습니다.

English

The automatic evaluation of LLM-based agent intelligence is critical in developing advanced LLM-based agents. Although considerable effort has been devoted to developing human-annotated evaluation datasets, such as AlpacaEval, existing techniques are costly, time-consuming, and lack adaptability. In this paper, inspired by the popular language game ``Who is Spy'', we propose to use the word guessing game to assess the intelligence performance of LLMs. Given a word, the LLM is asked to describe the word and determine its identity (spy or not) based on its and other players' descriptions. Ideally, an advanced agent should possess the ability to accurately describe a given word using an aggressive description while concurrently maximizing confusion in the conservative description, enhancing its participation in the game. To this end, we first develop DEEP to evaluate LLMs' expression and disguising abilities. DEEP requires LLM to describe a word in aggressive and conservative modes. We then introduce SpyGame, an interactive multi-agent framework designed to assess LLMs' intelligence through participation in a competitive language-based board game. Incorporating multi-agent interaction, SpyGame requires the target LLM to possess linguistic skills and strategic thinking, providing a more comprehensive evaluation of LLMs' human-like cognitive abilities and adaptability in complex communication situations. The proposed evaluation framework is very easy to implement. We collected words from multiple sources, domains, and languages and used the proposed evaluation framework to conduct experiments. Extensive experiments demonstrate that the proposed DEEP and SpyGame effectively evaluate the capabilities of various LLMs, capturing their ability to adapt to novel situations and engage in strategic communication.

대규모 언어 모델의 지능 평가를 위한 단어 추측 게임 활용

Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

초록

Support