AFFORDANCE20Q: 物理的特性からのアフォーダンス推論の評価

要旨

アフォーダンス推論、すなわち物体の物理的特性（形状や材質など）からその動作可能性を推論することは、人間の物理的理解の基盤であり、大規模言語モデル（LLMs）にとってますます重要になっている。しかし、既存のアフォーダンスベンチマークは、評価設定において物体の正体を明示的に露出させることが多く、その結果、モデルは物理的特性に基づく推論ではなく、記憶された物体-アフォーダンスマッピングに依存することが可能となる。このギャップに対処するため、我々は物体の正体を明かさない20質問ゲームとして構成された、新規のアフォーダンス推論ベンチマークであるAffordance20Qを導入する。各ゲームにおいて、モデルは物理的特性に関するはい・いいえの質問をすることで、候補セットから隠された物体のアフォーダンスを特定する。Affordance20Qは、454個の物体と59のアフォーダンスにわたる1,009ゲームから成り、すべて手動でフィルタリング、精緻化、アノテーションが行われている。我々は15の最先端LLMを用いた包括的な実験を実施し、人間のパフォーマンスと比較して約20ポイントの substantial なギャップを発見した。さらに、KL情報利得（IG）に基づく分析により、ゲームが進むにつれてモデルが識別力のある質問を行うことができないことが示された。このギャップを埋めるため、我々はKARI（KB-Anchored Rule Induction）を開発した。これは知識ベース（KB）からの証拠に基づいたアフォーダンスルールを生成するLLMベースのパイプラインである。KARIはオープンソースLLMを最大15.2ポイント向上させる一方、KBのカバレッジの限界が更なる改善を妨げている。すべてのコードとデータはhttps://github.com/1171-jpg/Affordance20Q.git で公開している。

English

Affordance reasoning, the inference of an object's action possibilities from its physical properties (e.g., shape and material), is fundamental to human physical understanding and increasingly critical for Large Language Models (LLMs). However, existing affordance benchmarks largely expose explicit object identities in the evaluation setup, allowing models to rely on memorized object-affordance mappings rather than reasoning over physical properties. To address this gap, we introduce Affordance20Q, a novel affordance reasoning benchmark formulated as a 20-Questions game without exposing the object's identity. In each game, the model identifies a hidden object's affordance from a candidate set by asking yes/no questions about its physical properties. Affordance20Q comprises 1,009 games over 454 objects and 59 affordances, all manually filtered, refined, and annotated. We conduct comprehensive experiments with 15 state-of-the-art LLMs and find a substantial gap (~20 points) compared to human performance. A KL-based information-gain (IG) analysis further shows that models fail to ask discriminating questions as the game progresses. To close the gap, we develop KB-Anchored Rule Induction (KARI), a pipeline based on LLMs that generates affordance rules grounded in evidence from knowledge bases (KBs). KARI improves open-source LLMs by up to 15.2 points, while the limited coverage of KBs hinders further gains. We release all our code and data at https://github.com/1171-jpg/Affordance20Q.git