推進大型多模態模型中的創造性物理智能

摘要

大型多模態模型（LMMs）在感知與推理能力上迅速進展；然而，這些能力是否能推廣到開放式環境中、超越模式辨識，從而發現視覺基礎的解決方案，仍屬未知。在此類情境中，智慧不僅在於回答明確提出的問題，更涉及辨別場景中的元素如何能以非顯而易見卻在物理上可行的方式被重新利用。這種創造性問題解決形式是人類智慧的核心，但在現有基準測試中大多尚未被檢驗。為評估此能力，我們引入 MM-CreativityBench，一個針對視覺豐富、物理受限環境中基於可供性（affordance）的創造性工具使用之基準。每個實例提供一幅情境影像，包含候選實體及其部件的結構化視圖，從而實現細粒度的互動式評估，用以觀察模型如何迭代地檢視場景、識別相關可供性，並組合出視覺與物理層面皆紮根的解決方案。我們的實驗顯示，當前 LMMs 常未能達標，並非因生成能力不足，而是因為它們無法維持紮根的探索。模型往往忽略相關實體、未能充分審視關鍵部件，或幻覺出影像中不存在的屬性。受此失敗模式啟發，我們提出「基於可供性的對齊」，將創造性工具使用視為偏好學習問題。利用直接偏好優化，我們鼓勵模型偏好以視覺證據為基礎的屬性-可供性推理，而非幻覺性的替代方案。此外，我們納入從可供性知識庫中推導出的監督訊號，以引導更廣泛的實體探索與多輪規劃。我們的結果顯示，在選擇正確實體與部件上取得持續改善，同時大幅減少幻覺與紮根相關的錯誤。

English

Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open-ended environments, beyond pattern recognition. In such settings, intelligence requires more than answering well-posed questions: it involves identifying how elements in a scene can be repurposed in non-obvious yet physically feasible ways. This form of creative problem-solving is central to human intelligence, but remains largely untested in current benchmarks. To evaluate this ability, we introduce MM-CreativityBench, a benchmark for affordance-grounded creative tool use in visually rich, physically constrained environments. Each instance presents a scenario image with structured views of candidate entities and their parts, enabling fine-grained, interactive evaluation of how models iteratively inspect the scene, identify relevant affordances, and compose visually and physically grounded solutions. Our experiments show that current LMMs often fall short, not due to lack of generative capability, but because they do not sustain grounded exploration. Models often overlook relevant entities, under-examine critical parts, or hallucinate attributes not grounded in the image. Motivated by this failure mode, we propose affordance-grounded alignment, which casts creative tool use as a preference learning problem. Using Direct Preference Optimization, we encourage models to prefer attribute-affordance reasoning grounded in visual evidence over hallucinated alternatives. In addition, we incorporate supervision derived from an affordance knowledge base to guide broader entity exploration and multi-turn planning. Our results show consistent gains in selecting the correct entities and parts, while substantially reducing hallucination and grounding-related errors.