大規模マルチモーダルモデルにおける創造的物理知能の発展

要旨

大規模マルチモーダルモデル（LMM）は知覚と推論において急速に進歩してきたが、これらの能力がパターン認識を超えて、オープンエンド環境で視覚に基づく解決策を発見することに汎化できるかどうかは依然として明らかではない。こうした環境では、知能は適切に設定された質問に答えるだけでは不十分であり、場面内の要素をどのように非自明でありながら物理的に実現可能な方法で転用できるかを特定することが含まれる。この形態の創造的問題解決は人間の知能の中心であるが、現在のベンチマークではほとんどテストされていない。この能力を評価するために、我々はMM-CreativityBenchを紹介する。これは視覚的に豊かで物理的制約のある環境におけるアフォーダンスに基づく創造的な道具使用のためのベンチマークである。各インスタンスは、候補エンティティとその部品の構造化されたビューを含むシナリオ画像を提示し、モデルが反復的にシーンを調査し、関連するアフォーダンスを特定し、視覚的および物理的に根拠づけられた解決策を構成する方法の細粒度でインタラクティブな評価を可能にする。我々の実験は、現在のLMMがしばしば不十分であり、その原因は生成能力の欠如ではなく、根拠に基づく探索を持続しないことにあることを示している。モデルはしばしば関連エンティティを見落とし、重要な部品を十分に調査せず、または画像に基づかない属性を幻覚する。この失敗モードに動機づけられ、我々はアフォーダンスに基づくアライメントを提案する。これは創造的な道具使用を選好学習問題として捉える。直接選好最適化を用いて、モデルが幻覚的な代替案よりも視覚的証拠に基づく属性-アフォーダンス推論を好むように促す。さらに、アフォーダンス知識ベースから得られる教師信号を取り入れ、より広範なエンティティ探索とマルチターン計画を導く。我々の結果は、正しいエンティティと部品の選択において一貫した改善を示し、幻覚および根拠関連のエラーを大幅に減少させる。

English

Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open-ended environments, beyond pattern recognition. In such settings, intelligence requires more than answering well-posed questions: it involves identifying how elements in a scene can be repurposed in non-obvious yet physically feasible ways. This form of creative problem-solving is central to human intelligence, but remains largely untested in current benchmarks. To evaluate this ability, we introduce MM-CreativityBench, a benchmark for affordance-grounded creative tool use in visually rich, physically constrained environments. Each instance presents a scenario image with structured views of candidate entities and their parts, enabling fine-grained, interactive evaluation of how models iteratively inspect the scene, identify relevant affordances, and compose visually and physically grounded solutions. Our experiments show that current LMMs often fall short, not due to lack of generative capability, but because they do not sustain grounded exploration. Models often overlook relevant entities, under-examine critical parts, or hallucinate attributes not grounded in the image. Motivated by this failure mode, we propose affordance-grounded alignment, which casts creative tool use as a preference learning problem. Using Direct Preference Optimization, we encourage models to prefer attribute-affordance reasoning grounded in visual evidence over hallucinated alternatives. In addition, we incorporate supervision derived from an affordance knowledge base to guide broader entity exploration and multi-turn planning. Our results show consistent gains in selecting the correct entities and parts, while substantially reducing hallucination and grounding-related errors.