CreativityBench: 도구의 어포던스 기반 재활용을 통한 에이전트 창의적 추론 평가

초록

대규모 언어 모델의 최근 발전은 추론 및 환경 상호작용 과제에서 강력한 성능을 이끌어 냈지만, 창의적 문제 해결 능력은 아직 충분히 탐구되지 않았습니다. 우리는 이러한 능력을 창의적 도구 사용이라는 관점에서 연구하는데, 여기서 모델은 정형화된 사용법에 의존하기보다는 객체의 어포던스와 속성에 대한 추론을 통해 주어진 객체를 새로운 목적으로 재활용합니다. 첫 번째 단계로서 우리는 LLM의 어포던스 기반 창의성을 평가하기 위한 벤치마크인 CreativityBench를 소개합니다. 이를 위해 4,000개의 엔티티와 15만 개 이상의 어포던스 주석으로 구성된 대규모 어포던스 지식 베이스(KB)를 구축하여 객체, 부품, 속성 및 실행 가능한 사용법을 명시적으로 연결했습니다. 이 KB를 바탕으로 제약 조건 내에서 비직관적이지만 물리적으로 타당한 해결책을 찾아야 하는 14,000개의 실제 기반 과제를 생성했습니다. 오픈소스 및 클로즈드 소스 모델을 포함한 10개의 최첨단 LLM에 대한 평가 결과, 모델들은 종종 타당한 객체를 선택할 수 있지만 과제 해결에 필요한 정확한 부품, 해당 부품의 어포던스 및 기반 물리적 메커니즘을 식별하는 데는 실패하여 성능이 크게 하락하는 것으로 나타났습니다. 더 나아가 모델 규모 확장에 따른 성능 향상은 빠르게 포화되었으며, 강력한 일반 추론 능력이 창의적 어포던스 발견으로 안정적으로 이어지지 않았고, 사고 연쇄(Chain-of-Thought)와 같은 일반적인 추론 시 전략은 제한된 성능 향상만을 가져왔습니다. 이러한 결과는 창의적 도구 사용이 현재 모델들의 주요 과제로 남아 있음을 시사하며, CreativityBench가 지능의 이 누락된 차원을 연구하고 향후 에이전트의 계획 및 추론 모듈에 영향을 미칠 수 있는 유용한 테스트베드를 제공함을 보여줍니다.

English

Recent advances in large language models have led to strong performance on reasoning and environment-interaction tasks, yet their ability for creative problem-solving remains underexplored. We study this capability through the lens of creative tool use, where a model repurposes available objects by reasoning about their affordances and attributes rather than relying on canonical usage. As a first step, we introduce CreativityBench, a benchmark for evaluating affordance-based creativity in LLMs. To this end, we build a large-scale affordance knowledge base (KB) with 4K entities and 150K+ affordance annotations, explicitly linking objects, parts, attributes, and actionable uses. Building on this KB, we generate 14K grounded tasks that require identifying non-obvious yet physically plausible solutions under constraints. Evaluations across 10 state-of-the-art LLMs, including closed and open-source models, show that models can often select a plausible object, but fail to identify the correct parts, their affordances, and the underlying physical mechanism needed to solve the task, leading to a significant drop in performance. Furthermore, improvements from model scaling quickly saturate, strong general reasoning does not reliably translate to creative affordance discovery, and common inference-time strategies such as Chain-of-Thought yield limited gains. These results suggest that creative tool use remains a major challenge for current models, and that CreativityBench provides a useful testbed for studying this missing dimension of intelligence, with potential implications for planning and reasoning modules in future agents.

CreativityBench: 도구의 어포던스 기반 재활용을 통한 에이전트 창의적 추론 평가

CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

초록

Support