马里奥:基于大型语言模型的多模态图推理
Mario: Multimodal Graph Reasoning with Large Language Models
March 5, 2026
作者: Yuanfu Sun, Kang Li, Pengkang Guo, Jiajin Liu, Qiaoyu Tan
cs.AI
摘要
近期大语言模型(LLM)的突破为多模态推理开辟了新途径。然而,现有方法大多仍依赖预训练视觉语言模型(VLM)对图文对进行孤立编码,忽视了现实世界多模态数据天然形成的关系结构。这推动了多模态图(MMG)推理的发展——图中每个节点具备文本与视觉双重属性,边则提供结构线索。要实现基于LLM的异质多模态信号推理同时保持图拓扑结构,需解决两大核心挑战:弱跨模态一致性问题与异质模态偏好问题。为此,我们提出Mario框架,通过统一架构同步解决上述挑战,实现高效的MMG多模态推理。该框架包含两个创新阶段:首先采用图条件化VLM设计,通过图拓扑引导的细粒度跨模态对比学习联合优化文本与视觉特征;其次引入模态自适应图指令调优机制,将对齐的多模态特征组织为图感知指令视图,并利用可学习路由器为每个节点及其邻域动态呈现最具信息量的模态配置。在多类MMG基准测试上的实验表明,Mario在节点分类与链接预测任务的有监督及零样本场景下,均持续超越当前最先进的图模型。代码已发布于https://github.com/sunyuanfu/Mario。
English
Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.