Skill-3D: 에이전트적 3D 공간 추론을 위한 진화하는 장면 인식 기술

초록

본 논문은 에이전트 기반 3D 공간 이해, 즉 MLLM 에이전트가 도구 사용을 통해 3D 추론을 수행하는 방식을 탐구한다. 기존 방법들은 3D 시나리오에서 도구를 자주 오용하고 편향된 도구 선호를 보여, 에이전트 패러다임이 비에이전트 전략 대비 미미한 성능 향상만을 보인다. 우리는 3D 공간 추론 과제가 장면에 따라 이질적인 반면, 이러한 에이전트들은 특정 장면과 과제에 따라 도구를 선택하지 않고 모든 장면에 균일한 도구 사용 전략을 적용한다는 점을 밝힌다. 이를 해결하기 위해, 우리는 자기 진화적 장면 인식 스킬을 학습하는 프레임워크인 Skill-3D를 제안한다. 구체적으로, Skill-3D는 과제 장면을 식별하고 에이전트의 도구 사용 궤적을 장면 메모리에 기록하며, 유사한 장면의 성공적인 궤적을 집계하고 증류하여 재사용 가능한 장면 인식 스킬로 만들고, 실패한 궤적은 해당 스킬에 교훈으로 첨부한다. 훈련 중에 유사한 장면이 다시 나타나면 해당 스킬이 주입되어 에이전트를 안내하고, 새로운 궤적을 생성하며, 그 성공과 실패가 스킬을 더욱 정제하여 메모리와 스킬 라이브러리가 공진화하는 루프를 형성한다. 실험 결과, Skill-3D는 3D 공간 추론에서 도구 활용도를 크게 개선하며(VSI-Bench에서 39%에서 78%로), 에이전트를 정확하고 충분한 도구 사용으로 이끈다. 예를 들어, MMSI-Bench에서 Gemini-3-Flash의 성능을 67% 향상시킨다. 또한, 스킬 기반 궤적을 통해 에이전트 사후 훈련을 수행하여 VSI-Bench에서 Qwen3-VL-8B의 성능을 43% 향상시킨다.

English

This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent's tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 43% on VSI-Bench.