대규모 다중 모달 모델에서 창의적 물리 지능의 발전

초록

대규모 멀티모달 모델(LMM)은 지각과 추론 능력에서 빠르게 발전해 왔으나, 이러한 능력이 패턴 인식을 넘어 개방형 환경에서 시각적으로 근거된 해결책을 발견하는 데까지 일반화될 수 있는지 여부는 여전히 불분명하다. 이러한 환경에서 지능은 잘 정의된 질문에 답하는 것 이상으로, 장면 내 요소들이 명확하지 않으면서도 물리적으로 실현 가능한 방식으로 어떻게 재사용될 수 있는지를 식별하는 것을 포함한다. 이러한 형태의 창의적 문제 해결은 인간 지능의 핵심이지만, 현재의 벤치마크에서는 대부분 테스트되지 않았다. 이 능력을 평가하기 위해, 우리는 시각적으로 풍부하고 물리적 제약이 있는 환경에서 어포던스 기반 창의적 도구 사용을 위한 벤치마크인 MM-CreativityBench를 소개한다. 각 인스턴스는 후보 개체와 그 부분들의 구조화된 뷰를 포함한 시나리오 이미지를 제시하여, 모델이 반복적으로 장면을 검사하고 관련 어포던스를 식별하며 시각적 및 물리적으로 근거된 해결책을 구성하는 방식에 대한 세분화된 대화형 평가를 가능하게 한다. 실험 결과, 현재의 LMM은 생성 능력 부족이 아닌, 근거된 탐색을 지속하지 못하기 때문에 종종 부족한 성과를 보인다. 모델은 종종 관련 개체를 간과하거나, 중요한 부분을 충분히 검토하지 않거나, 이미지에 근거하지 않은 속성을 환각한다. 이러한 실패 양상에 동기 부여되어, 우리는 창의적 도구 사용을 선호 학습 문제로 보는 어포던스 기반 정렬(affordance-grounded alignment)을 제안한다. 직접 선호 최적화(DPO)를 사용하여, 모델이 환각된 대안보다 시각적 증거에 근거한 속성-어포던스 추론을 선호하도록 유도한다. 또한, 어포던스 지식 기반에서 파생된 감독을 통합하여 더 넓은 개체 탐색과 다중 회차 계획을 안내한다. 결과는 올바른 개체와 부분을 선택하는 데 있어 일관된 성능 향상을 보여주며, 환각 및 근거 관련 오류를 크게 줄인다.

English

Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open-ended environments, beyond pattern recognition. In such settings, intelligence requires more than answering well-posed questions: it involves identifying how elements in a scene can be repurposed in non-obvious yet physically feasible ways. This form of creative problem-solving is central to human intelligence, but remains largely untested in current benchmarks. To evaluate this ability, we introduce MM-CreativityBench, a benchmark for affordance-grounded creative tool use in visually rich, physically constrained environments. Each instance presents a scenario image with structured views of candidate entities and their parts, enabling fine-grained, interactive evaluation of how models iteratively inspect the scene, identify relevant affordances, and compose visually and physically grounded solutions. Our experiments show that current LMMs often fall short, not due to lack of generative capability, but because they do not sustain grounded exploration. Models often overlook relevant entities, under-examine critical parts, or hallucinate attributes not grounded in the image. Motivated by this failure mode, we propose affordance-grounded alignment, which casts creative tool use as a preference learning problem. Using Direct Preference Optimization, we encourage models to prefer attribute-affordance reasoning grounded in visual evidence over hallucinated alternatives. In addition, we incorporate supervision derived from an affordance knowledge base to guide broader entity exploration and multi-turn planning. Our results show consistent gains in selecting the correct entities and parts, while substantially reducing hallucination and grounding-related errors.