空間知識圖譜引導的多模態合成

摘要

近期，多模態大型語言模型（MLLMs）的顯著進展大幅提升了其能力；然而，其空間感知能力仍是一個顯著的限制。為應對這一挑戰，多模態數據合成提供了一種有前景的解決方案。然而，確保合成數據符合空間常識並非易事。在本研究中，我們提出了SKG2Data，這是一種基於空間知識圖引導的新型多模態合成方法，其核心理念是知識到數據的生成。SKG2Data自動構建空間知識圖（SKG），以模擬人類對空間方向和距離的感知，並利用此圖來指導多模態數據的合成。大量實驗表明，基於多種類型空間知識（包括方向和距離）合成的數據，不僅提升了MLLMs的空間感知與推理能力，還展現出強大的泛化能力。我們期望，基於知識的數據合成理念能夠推動空間智能的發展。

English

Recent advances in multimodal large language models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. In this work, we introduce SKG2Data, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2Data automatically constructs a Spatial Knowledge Graph (SKG) to emulate human-like perception of spatial directions and distances, which is subsequently utilized to guide multimodal data synthesis. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, not only enhance the spatial perception and reasoning abilities of MLLMs but also exhibit strong generalization capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence.

空間知識圖譜引導的多模態合成

Spatial Knowledge Graph-Guided Multimodal Synthesis

摘要

Support