ChatPaper.aiChatPaper

评估语言模型对游戏的评价能力

Evaluating Language Models' Evaluations of Games

October 13, 2025
作者: Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Prafull Sharma, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, Thomas L. Griffiths
cs.AI

摘要

推理不仅仅是解决问题——它还关乎评估哪些问题真正值得解决。历史上,对人工智能(AI)系统的评估主要聚焦于问题解决能力,通过研究模型如何下棋或玩围棋等游戏来进行。本文中,我们倡导一种新范式,即评估AI系统对游戏的评价能力。首先,我们引入了一种形式化方法来评估此类评价。随后,我们利用一个包含超过100种新颖棋盘游戏及450多份人类评判的大规模数据集,将现代语言与推理模型产生的评价与人类及符号计算代理的评价进行对比。我们考察了两种评价性查询:评估游戏的收益(或公平性)和趣味性。这些查询涵盖了设计AI评价评估的两个相关维度:查询的计算复杂度及其量化难度。我们的结果表明,在游戏评价上,推理模型通常比非推理语言模型更贴近人类判断。然而,我们观察到一个非单调关系:随着模型趋近于博弈论最优,它们与人类数据的契合度反而减弱。此外,在评估趣味性时,模型间表现出更多的“波动性”,这与量化该查询的更大难度相符。跨查询与游戏,推理模型在评估查询时展现出高度可变且不可预测的资源消耗,这凸显了在语言与推理模型中融入更多资源理性元推理的重要性。
English
Reasoning is not just about solving problems -- it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over 100 novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more "jaggedness" across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.
PDF02October 16, 2025