CreativityBench: アフォーダンスに基づく工具流用によるエージェントの創造的推論評価

要旨

大規模言語モデルの近年の進歩は、推論や環境相互作用タスクにおいて強力な性能を発揮しているが、創造的問題解決能力については未開拓の領域が残されている。本研究では、この能力を「創造的道具使用」の観点から検証する。創造的道具使用とは、モデルが利用可能な物体を、従来の用法に依存するのではなく、そのアフォーダンス（行為可能性）や属性に基づいて再目的化することを指す。第一段階として、LLMのアフォーダンスに基づく創造性を評価するベンチマーク「CreativityBench」を導入する。このために、4,000の実体と15万以上のアフォーダンス注釈からなる大規模なアフォーダンス知識ベース（KB）を構築し、物体、部分、属性、実現可能な使用法を明示的に結びつけた。このKBに基づき、制約条件下で自明ではないが物理的に可能な解決策を特定することを要求する、14,000のグラウンディングされたタスクを生成した。クローズドモデルとオープンソースモデルを含む10の最新LLMによる評価では、モデルがしばしば妥当な物体を選択できる一方、正しい部分、そのアフォーダンス、および課題解決に必要な物理メカニズムを特定できず、性能が大幅に低下することが明らかになった。さらに、モデル規模の拡大による性能向上は急速に頭打ちとなり、強力な一般的推論能力が創造的アフォーダンス発見に確実には結びつかず、連鎖的思考（Chain-of-Thought）のような一般的な推論時戦略による効果も限定的であった。これらの結果は、創造的道具使用が現在のモデルにとって主要な課題であり、CreativityBenchがこの知性の欠落次元を研究する有用なテストベッドを提供し、将来のエージェントにおける計画・推論モジュールに示唆を与える可能性があることを示唆している。

English

Recent advances in large language models have led to strong performance on reasoning and environment-interaction tasks, yet their ability for creative problem-solving remains underexplored. We study this capability through the lens of creative tool use, where a model repurposes available objects by reasoning about their affordances and attributes rather than relying on canonical usage. As a first step, we introduce CreativityBench, a benchmark for evaluating affordance-based creativity in LLMs. To this end, we build a large-scale affordance knowledge base (KB) with 4K entities and 150K+ affordance annotations, explicitly linking objects, parts, attributes, and actionable uses. Building on this KB, we generate 14K grounded tasks that require identifying non-obvious yet physically plausible solutions under constraints. Evaluations across 10 state-of-the-art LLMs, including closed and open-source models, show that models can often select a plausible object, but fail to identify the correct parts, their affordances, and the underlying physical mechanism needed to solve the task, leading to a significant drop in performance. Furthermore, improvements from model scaling quickly saturate, strong general reasoning does not reliably translate to creative affordance discovery, and common inference-time strategies such as Chain-of-Thought yield limited gains. These results suggest that creative tool use remains a major challenge for current models, and that CreativityBench provides a useful testbed for studying this missing dimension of intelligence, with potential implications for planning and reasoning modules in future agents.

CreativityBench: アフォーダンスに基づく工具流用によるエージェントの創造的推論評価

CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

要旨

Support