AFFORDANCE20Q: 基于物理属性的可供性推理评估

摘要

可供性推理是指从物体的物理属性（如形状和材料）推断其动作可能性，这是人类物理理解的基础，并且对大型语言模型（LLMs）日益重要。然而，现有的可供性基准测试大多在评估设置中直接暴露物体身份，使模型能够依赖记忆的物体-可供性映射而非基于物理属性进行推理。为弥补这一空白，我们提出了Affordance20Q，这是一个新颖的可供性推理基准测试，采用20个问题游戏的形式，不暴露物体身份。在每局游戏中，模型通过询问关于物体物理属性的是/否问题，从候选集中识别隐藏物体的可供性。Affordance20Q包含1,009局游戏，涵盖454个物体和59种可供性，所有数据均经过人工筛选、精炼和标注。我们使用15个最先进的大语言模型进行了全面实验，发现与人类表现存在显著差距（约20个百分点）。基于KL散度的信息增益分析进一步表明，模型在游戏进行中未能提出有区分度的问题。为缩小这一差距，我们开发了基于知识库锚定的规则归纳（KARI），这是一种基于LLM的流水线，可生成以知识库证据为基础的可供性规则。KARI将开源大语言模型的性能提升了高达15.2个百分点，但知识库覆盖范围有限制约了进一步提升。我们将所有代码和数据发布在https://github.com/1171-jpg/Affordance20Q.git。

English

Affordance reasoning, the inference of an object's action possibilities from its physical properties (e.g., shape and material), is fundamental to human physical understanding and increasingly critical for Large Language Models (LLMs). However, existing affordance benchmarks largely expose explicit object identities in the evaluation setup, allowing models to rely on memorized object-affordance mappings rather than reasoning over physical properties. To address this gap, we introduce Affordance20Q, a novel affordance reasoning benchmark formulated as a 20-Questions game without exposing the object's identity. In each game, the model identifies a hidden object's affordance from a candidate set by asking yes/no questions about its physical properties. Affordance20Q comprises 1,009 games over 454 objects and 59 affordances, all manually filtered, refined, and annotated. We conduct comprehensive experiments with 15 state-of-the-art LLMs and find a substantial gap (~20 points) compared to human performance. A KL-based information-gain (IG) analysis further shows that models fail to ask discriminating questions as the game progresses. To close the gap, we develop KB-Anchored Rule Induction (KARI), a pipeline based on LLMs that generates affordance rules grounded in evidence from knowledge bases (KBs). KARI improves open-source LLMs by up to 15.2 points, while the limited coverage of KBs hinders further gains. We release all our code and data at https://github.com/1171-jpg/Affordance20Q.git