空间知识图谱引导的多模态合成

摘要

近期，多模态大语言模型（MLLMs）的显著进展极大地提升了其能力；然而，其空间感知能力仍是一个显著的局限。为应对这一挑战，多模态数据合成提供了一种颇具前景的解决方案。然而，确保合成数据遵循空间常识并非易事。本研究中，我们提出了SKG2Data，一种基于空间知识图谱引导的新型多模态合成方法，其核心理念是知识到数据的生成。SKG2Data自动构建空间知识图谱（SKG），以模拟人类对空间方向与距离的感知，进而指导多模态数据的合成。大量实验表明，基于多种空间知识（包括方向与距离）合成的数据，不仅增强了MLLMs的空间感知与推理能力，还展现出强大的泛化能力。我们期望，基于知识的数据合成理念能够推动空间智能的发展。

English

Recent advances in multimodal large language models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. In this work, we introduce SKG2Data, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2Data automatically constructs a Spatial Knowledge Graph (SKG) to emulate human-like perception of spatial directions and distances, which is subsequently utilized to guide multimodal data synthesis. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, not only enhance the spatial perception and reasoning abilities of MLLMs but also exhibit strong generalization capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence.

空间知识图谱引导的多模态合成

Spatial Knowledge Graph-Guided Multimodal Synthesis

摘要

Support