MEXA：迈向基于动态多专家聚合的通用多模态推理

摘要

结合预训练专家模型为可扩展的多模态推理提供了巨大潜力，但由于输入模态的日益多样性和任务复杂性的增加，构建统一框架仍面临挑战。例如，医疗诊断需要对结构化临床表格进行精确推理，而金融预测则依赖于解读基于图表的数据以做出明智预测。为应对这一挑战，我们提出了MEXA，一个无需训练的框架，它通过模态和任务感知的方式聚合多个专家模型，从而在多样且不同的领域中实现有效的多模态推理。MEXA根据输入模态和任务特定的推理需求（即技能）动态选择专家模型。每个专家模型专精于某一模态任务对，生成可解释的文本推理输出。随后，MEXA利用大型推理模型（LRM）对这些输出进行聚合和推理，以产生最终答案。这种模块化设计允许在无需额外训练开销的情况下，跨多样领域进行灵活且透明的多模态推理。我们在包括视频推理、音频推理、3D理解和医疗问答在内的多种多模态基准上广泛评估了我们的方法。MEXA在强大多模态基线模型上持续展现出性能提升，凸显了我们在多样化多模态推理任务中专家驱动选择和聚合的有效性及广泛适用性。

English

Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.

MEXA：迈向基于动态多专家聚合的通用多模态推理

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

摘要

Support