MEXA：動的マルチエキスパート集約による汎用マルチモーダル推論の実現に向けて

要旨

事前学習された専門家モデルを組み合わせることで、スケーラブルなマルチモーダル推論の大きな可能性が得られるが、入力モダリティの多様化とタスクの複雑さの増大により、統一的なフレームワークの構築は依然として課題である。例えば、医療診断では構造化された臨床表に対する精密な推論が必要であり、金融予測ではプロットベースのデータを解釈して情報に基づいた予測を行うことが求められる。この課題に対処するため、我々はMEXAを提案する。MEXAは、多様で異なるドメインにわたる効果的なマルチモーダル推論を可能にするために、複数の専門家モデルをモダリティおよびタスクを意識して集約するトレーニング不要のフレームワークである。MEXAは、入力モダリティとタスク固有の推論要求（すなわちスキル）に基づいて専門家モデルを動的に選択する。各専門家モデルは、特定のモダリティとタスクのペアに特化しており、解釈可能なテキスト形式の推論出力を生成する。MEXAはこれらの出力を集約し、大規模推論モデル（LRM）を使用して最終的な回答を導出する。このモジュール設計により、追加のトレーニングオーバーヘッドなしに、多様なドメインにわたる柔軟で透明性の高いマルチモーダル推論が可能となる。我々は、ビデオ推論、オーディオ推論、3D理解、医療QAなど、多様なマルチモーダルベンチマークにおいて本アプローチを広範に評価した。MEXAは、強力なマルチモーダルベースラインを一貫して上回る性能向上を示し、多様なマルチモーダル推論タスクにおける専門家駆動型の選択と集約の有効性と広範な適用性を強調している。

English

Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.

MEXA：動的マルチエキスパート集約による汎用マルチモーダル推論の実現に向けて

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

要旨

Support