어포던스20Q: 물리적 속성으로부터의 어포던스 추론 평가

초록

어포던스 추론(Affordance reasoning), 즉 물체의 형태와 재질 같은 물리적 특성으로부터 그 물체가 제공하는 행동 가능성을 추론하는 것은 인간의 물리적 이해에 필수적이며, 대규모 언어 모델(LLM)에게 점점 더 중요해지고 있다. 그러나 기존의 어포던스 벤치마크는 평가 설정에서 객체의 정체성을 명확히 드러내어, 모델이 물리적 특성에 대한 추론보다는 암기된 객체-어포던스 매핑에 의존할 수 있도록 한다. 이러한 격차를 해소하기 위해, 우리는 객체의 정체성을 드러내지 않고 20가지 질문 게임 형식으로 구성된 새로운 어포던스 추론 벤치마크인 Affordance20Q를 소개한다. 각 게임에서 모델은 후보 집합에서 숨겨진 객체의 어포던스를 식별하기 위해 해당 객체의 물리적 특성에 대한 예/아니오 질문을 한다. Affordance20Q는 454개 객체와 59개 어포던스에 걸쳐 1,009개의 게임으로 구성되며, 모두 수동으로 필터링, 정제 및 주석 처리되었다. 최첨단 LLM 15개를 대상으로 포괄적인 실험을 수행한 결과, 인간의 성능과 비교하여 상당한 격차(약 20포인트)를 발견했다. KL 기반 정보 이득(IG) 분석 결과, 게임이 진행됨에 따라 모델이 변별력 있는 질문을 하지 못하는 것으로 나타났다. 이러한 격차를 해소하기 위해, 우리는 지식 베이스(KB)의 증거에 기반한 어포던스 규칙을 생성하는 LLM 기반 파이프라인인 KB 기반 규칙 유도(KB-Anchored Rule Induction, KARI)를 개발했다. KARI는 오픈소스 LLM의 성능을 최대 15.2포인트 향상시키지만, KB의 제한된 범위가 추가적인 성능 향상을 저해한다. 모든 코드와 데이터는 https://github.com/1171-jpg/Affordance20Q.git에서 공개한다.

English

Affordance reasoning, the inference of an object's action possibilities from its physical properties (e.g., shape and material), is fundamental to human physical understanding and increasingly critical for Large Language Models (LLMs). However, existing affordance benchmarks largely expose explicit object identities in the evaluation setup, allowing models to rely on memorized object-affordance mappings rather than reasoning over physical properties. To address this gap, we introduce Affordance20Q, a novel affordance reasoning benchmark formulated as a 20-Questions game without exposing the object's identity. In each game, the model identifies a hidden object's affordance from a candidate set by asking yes/no questions about its physical properties. Affordance20Q comprises 1,009 games over 454 objects and 59 affordances, all manually filtered, refined, and annotated. We conduct comprehensive experiments with 15 state-of-the-art LLMs and find a substantial gap (~20 points) compared to human performance. A KL-based information-gain (IG) analysis further shows that models fail to ask discriminating questions as the game progresses. To close the gap, we develop KB-Anchored Rule Induction (KARI), a pipeline based on LLMs that generates affordance rules grounded in evidence from knowledge bases (KBs). KARI improves open-source LLMs by up to 15.2 points, while the limited coverage of KBs hinders further gains. We release all our code and data at https://github.com/1171-jpg/Affordance20Q.git