공간 지식 그래프 기반 다중모달 합성

초록

최근 멀티모달 대형 언어 모델(MLLM)의 발전은 그 능력을 크게 향상시켰으나, 공간 인식 능력은 여전히 주요한 한계로 남아 있습니다. 이러한 문제를 해결하기 위해 멀티모달 데이터 합성은 유망한 해결책을 제공합니다. 그러나 합성된 데이터가 공간 상식을 준수하도록 하는 것은 사소하지 않은 과제입니다. 본 연구에서는 지식-데이터 생성 개념에 기반한 공간 지식 그래프(SKG)를 활용한 새로운 멀티모달 합성 접근법인 SKG2Data를 소개합니다. SKG2Data는 인간과 유사한 방향 및 거리 인식을 모방하기 위해 공간 지식 그래프(SKG)를 자동으로 구성하고, 이를 멀티모달 데이터 합성을 안내하는 데 활용합니다. 다양한 유형의 공간 지식(방향 및 거리 포함)에서 합성된 데이터는 MLLM의 공간 인식 및 추론 능력을 향상시킬 뿐만 아니라 강력한 일반화 능력을 보여주는 것으로 광범위한 실험을 통해 입증되었습니다. 우리는 지식 기반 데이터 합성의 아이디어가 공간 지능의 발전을 촉진할 수 있기를 기대합니다.

English

Recent advances in multimodal large language models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. In this work, we introduce SKG2Data, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2Data automatically constructs a Spatial Knowledge Graph (SKG) to emulate human-like perception of spatial directions and distances, which is subsequently utilized to guide multimodal data synthesis. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, not only enhance the spatial perception and reasoning abilities of MLLMs but also exhibit strong generalization capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence.

공간 지식 그래프 기반 다중모달 합성

Spatial Knowledge Graph-Guided Multimodal Synthesis

초록

Support