AffordBot: 멀티모달 대규모 언어 모델 기반 3D 세밀 체화적 추론

초록

물리적 환경에서 효과적인 인간-에이전트 협업은 단순히 행동 대상이 무엇인지 이해하는 것을 넘어, 행동 가능 요소들의 위치와 상호작용 방법을 파악해야 합니다. 기존 접근법들은 객체 수준에서 작동하거나 세밀한 어포던스 추론을 분리적으로 처리하여, 지시어 기반의 일관된 근거 확립과 추론이 부족했습니다. 본 연구에서는 3D 장면 내 참조된 각 어포던스 요소에 대해, 과제 지시어를 바탕으로 공간적 위치, 동작 유형, 동작 축으로 구성된 구조화된 삼중항을 예측하는 새로운 과제인 세밀한 3D 체화 추론을 소개합니다. 이를 해결하기 위해 다중모드 대형 언어 모델(MLLM)과 맞춤형 사고 연쇄(CoT) 추론 패러다임을 통합한 새로운 프레임워크인 AffordBot을 제안합니다. 3D 입력과 2D 호환 MLLM 간의 간극을 해결하기 위해 장면의 서라운드 뷰 이미지를 렌더링하고 3D 요소 후보들을 해당 뷰에 투영하여 장면 기하학과 정렬된 풍부한 시각적 표현을 구성합니다. 우리의 CoT 파이프라인은 지시어를 바탕으로 가장 정보량이 많은 시점을 선택하도록 MLLM을 유도하는 능동적 인지 단계로 시작하여, 단계별 추론을 통해 어포던스 요소를 위치 특정하고 타당한 상호작용 동작을 추론합니다. SceneFun3D 데이터셋에서 평가된 AffordBot은 3D 포인트 클라우드 입력과 MLLM만으로 최첨단 성능을 달성하며, 강력한 일반화 능력과 물리적 근거 기반 추론 능력을 입증했습니다.

English

Effective human-agent collaboration in physical environments requires understanding not only what to act upon, but also where the actionable elements are and how to interact with them. Existing approaches often operate at the object level or disjointedly handle fine-grained affordance reasoning, lacking coherent, instruction-driven grounding and reasoning. In this work, we introduce a new task: Fine-grained 3D Embodied Reasoning, which requires an agent to predict, for each referenced affordance element in a 3D scene, a structured triplet comprising its spatial location, motion type, and motion axis, based on a task instruction. To solve this task, we propose AffordBot, a novel framework that integrates Multimodal Large Language Models (MLLMs) with a tailored chain-of-thought (CoT) reasoning paradigm. To bridge the gap between 3D input and 2D-compatible MLLMs, we render surround-view images of the scene and project 3D element candidates into these views, forming a rich visual representation aligned with the scene geometry. Our CoT pipeline begins with an active perception stage, prompting the MLLM to select the most informative viewpoint based on the instruction, before proceeding with step-by-step reasoning to localize affordance elements and infer plausible interaction motions. Evaluated on the SceneFun3D dataset, AffordBot achieves state-of-the-art performance, demonstrating strong generalization and physically grounded reasoning with only 3D point cloud input and MLLMs.

AffordBot: 멀티모달 대규모 언어 모델 기반 3D 세밀 체화적 추론

AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

초록

Support