MEXA: 동적 다중 전문가 통합을 통한 일반적 다중모드 추론 모델

초록

사전 훈련된 전문가 모델들을 결합하는 것은 확장 가능한 다중 모달리티 추론에 상당한 잠재력을 제공하지만, 입력 모달리티의 증가하는 다양성과 작업 복잡성으로 인해 통합 프레임워크를 구축하는 것은 여전히 어려운 과제입니다. 예를 들어, 의학 진단은 구조화된 임상 테이블에 대한 정밀한 추론을 요구하는 반면, 금융 예측은 플롯 기반 데이터를 해석하여 정보에 기반한 예측을 내리는 데 의존합니다. 이러한 과제를 해결하기 위해, 우리는 다양한 및 독특한 도메인에서 효과적인 다중 모달리티 추론을 가능하게 하기 위해 여러 전문가 모델의 모달리티 및 작업 인식 집계를 수행하는 훈련이 필요 없는 프레임워크인 MEXA를 소개합니다. MEXA는 입력 모달리티와 작업별 추론 요구 사항(즉, 기술)에 기반하여 전문가 모델을 동적으로 선택합니다. 각 전문가 모델은 특정 모달리티 작업 쌍에 특화되어 해석 가능한 텍스트 추론 출력을 생성합니다. MEXA는 이러한 출력들을 대형 추론 모델(LRM)을 사용하여 집계하고 추론하여 최종 답변을 생성합니다. 이 모듈식 설계는 추가적인 훈련 오버헤드 없이 다양한 도메인에서 유연하고 투명한 다중 모달리티 추론을 가능하게 합니다. 우리는 비디오 추론, 오디오 추론, 3D 이해, 의학 QA 등 다양한 다중 모달리티 벤치마크에서 우리의 접근 방식을 광범위하게 평가합니다. MEXA는 강력한 다중 모달리티 베이스라인에 비해 지속적으로 성능 향상을 제공하며, 다양한 다중 모달리티 추론 작업에서 우리의 전문가 기반 선택 및 집계의 효과성과 광범위한 적용 가능성을 강조합니다.

English

Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.

MEXA: 동적 다중 전문가 통합을 통한 일반적 다중모드 추론 모델

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

초록

Support