MEXA：邁向基於動態多專家聚合的通用多模態推理

摘要

結合預訓練的專家模型為可擴展的多模態推理提供了巨大潛力，但由於輸入模態的多樣性和任務複雜性不斷增加，構建一個統一的框架仍然具有挑戰性。例如，醫學診斷需要對結構化的臨床表格進行精確推理，而金融預測則依賴於解讀基於圖表的數據以做出明智的預測。為應對這一挑戰，我們引入了MEXA，這是一個無需訓練的框架，能夠根據模態和任務感知對多個專家模型進行聚合，從而實現跨多樣且不同領域的有效多模態推理。MEXA根據輸入模態和任務特定的推理需求（即技能）動態選擇專家模型。每個專家模型專注於一個模態任務對，並生成可解釋的文本推理輸出。MEXA隨後使用大型推理模型（LRM）對這些輸出進行聚合和推理，以產生最終答案。這種模塊化設計允許在不增加訓練開銷的情況下，跨多樣領域進行靈活且透明的多模態推理。我們在多樣的多模態基準上廣泛評估了我們的方法，包括視頻推理、音頻推理、3D理解和醫學問答。MEXA在強大多模態基線上的性能提升一致，凸顯了我們專家驅動的選擇和聚合在多樣多模態推理任務中的有效性和廣泛適用性。

English

Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.

MEXA：邁向基於動態多專家聚合的通用多模態推理

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

摘要

Support