推进大型多模态模型中的创造性物理智能

摘要

大型多模态模型（LMMs）在感知与推理方面取得了快速发展，然而，这些能力是否能够泛化到开放环境中发现视觉上具象化的解决方案（而不仅仅是模式识别），仍不明确。在此类场景中，智能不仅仅体现在回答定义明确的问题上，更涉及识别场景中的元素如何以非显而易见但物理上可行的方式被重新利用。这种创造性问题解决形式是人类智能的核心，但在现有基准测试中基本未得到检验。为评估这一能力，我们提出了MM-CreativityBench——一个面向丰富视觉、物理约束环境下基于功能属性的创造性工具使用的基准测试。每个实例包含一个场景图像，以及候选实体及其部件的结构化视图，从而支持细粒度、交互式的评估，考察模型如何迭代地检查场景、识别相关功能属性，并组合出视觉与物理上具象化的解决方案。实验表明，当前LMMs往往表现不佳，其根源并非生成能力不足，而是无法持续进行具象化的探索。模型常常忽略相关实体、未能充分检查关键部件，或幻觉出图像中不存在的属性。受此失败模式的启发，我们提出了功能属性对齐方法，将创造性工具使用视为一个偏好学习问题。通过直接偏好优化（DPO），我们鼓励模型更倾向于基于视觉证据的属性-功能推理，而非幻觉性替代方案。此外，我们引入了源自功能属性知识库的监督信号，以引导更广泛的实体探索和多轮规划。实验结果显示，该方法在正确选择实体和部件方面取得了一致性提升，同时显著减少了幻觉和与具象化相关的错误。

English

Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open-ended environments, beyond pattern recognition. In such settings, intelligence requires more than answering well-posed questions: it involves identifying how elements in a scene can be repurposed in non-obvious yet physically feasible ways. This form of creative problem-solving is central to human intelligence, but remains largely untested in current benchmarks. To evaluate this ability, we introduce MM-CreativityBench, a benchmark for affordance-grounded creative tool use in visually rich, physically constrained environments. Each instance presents a scenario image with structured views of candidate entities and their parts, enabling fine-grained, interactive evaluation of how models iteratively inspect the scene, identify relevant affordances, and compose visually and physically grounded solutions. Our experiments show that current LMMs often fall short, not due to lack of generative capability, but because they do not sustain grounded exploration. Models often overlook relevant entities, under-examine critical parts, or hallucinate attributes not grounded in the image. Motivated by this failure mode, we propose affordance-grounded alignment, which casts creative tool use as a preference learning problem. Using Direct Preference Optimization, we encourage models to prefer attribute-affordance reasoning grounded in visual evidence over hallucinated alternatives. In addition, we incorporate supervision derived from an affordance knowledge base to guide broader entity exploration and multi-turn planning. Our results show consistent gains in selecting the correct entities and parts, while substantially reducing hallucination and grounding-related errors.