MEXA:迈向基于动态多专家聚合的通用多模态推理
MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation
June 20, 2025
作者: Shoubin Yu, Yue Zhang, Ziyang Wang, Jaehong Yoon, Mohit Bansal
cs.AI
摘要
结合预训练专家模型为可扩展的多模态推理提供了巨大潜力,但由于输入模态的日益多样性和任务复杂性的增加,构建统一框架仍面临挑战。例如,医疗诊断需要对结构化临床表格进行精确推理,而金融预测则依赖于解读基于图表的数据以做出明智预测。为应对这一挑战,我们提出了MEXA,一个无需训练的框架,它通过模态和任务感知的方式聚合多个专家模型,从而在多样且不同的领域中实现有效的多模态推理。MEXA根据输入模态和任务特定的推理需求(即技能)动态选择专家模型。每个专家模型专精于某一模态任务对,生成可解释的文本推理输出。随后,MEXA利用大型推理模型(LRM)对这些输出进行聚合和推理,以产生最终答案。这种模块化设计允许在无需额外训练开销的情况下,跨多样领域进行灵活且透明的多模态推理。我们在包括视频推理、音频推理、3D理解和医疗问答在内的多种多模态基准上广泛评估了我们的方法。MEXA在强大多模态基线模型上持续展现出性能提升,凸显了我们在多样化多模态推理任务中专家驱动选择和聚合的有效性及广泛适用性。
English
Combining pre-trained expert models offers substantial potential for scalable
multimodal reasoning, but building a unified framework remains challenging due
to the increasing diversity of input modalities and task complexity. For
instance, medical diagnosis requires precise reasoning over structured clinical
tables, while financial forecasting depends on interpreting plot-based data to
make informed predictions. To tackle this challenge, we introduce MEXA, a
training-free framework that performs modality- and task-aware aggregation of
multiple expert models to enable effective multimodal reasoning across diverse
and distinct domains. MEXA dynamically selects expert models based on the input
modality and the task-specific reasoning demands (i.e., skills). Each expert
model, specialized in a modality task pair, generates interpretable textual
reasoning outputs. MEXA then aggregates and reasons over these outputs using a
Large Reasoning Model (LRM) to produce the final answer. This modular design
allows flexible and transparent multimodal reasoning across diverse domains
without additional training overhead. We extensively evaluate our approach on
diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D
Understanding, and Medical QA. MEXA consistently delivers performance
improvements over strong multimodal baselines, highlighting the effectiveness
and broad applicability of our expert-driven selection and aggregation in
diverse multimodal reasoning tasks.