ChatPaper.aiChatPaper

R2-T2:測試時期的多模態專家混合模型重定向

R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

February 27, 2025
作者: Zhongyang Li, Ziyue Li, Tianyi Zhou
cs.AI

摘要

在大型多模態模型(LMMs)中,非語言模態(如視覺表徵)的感知能力通常無法與大型語言模型(LLMs)強大的推理能力相媲美,這阻礙了LMMs在具有挑戰性的下游任務中的表現。近期,這一弱點通過採用專家混合(MoE)架構替代視覺編碼器得到了緩解,該架構提供了豐富、多粒度且多樣化的表徵,以滿足不同下游任務的需求。多模態MoE的性能很大程度上依賴於其路由器,該路由器根據每個輸入重新權衡並混合不同專家的表徵。然而,我們發現端到端訓練的路由器並不能始終為每個測試樣本生成最優的路由權重。為彌補這一差距,我們提出了一種新穎且高效的方法——"測試時重路由(R2-T2)",該方法在測試時通過將路由權重向量向測試樣本鄰域內正確預測樣本的權重向量移動,來局部優化路由權重向量。我們提出了三種R2-T2策略,分別具有不同的優化目標和鄰域搜索空間。R2-T2在不訓練任何基礎模型參數的情況下,持續且顯著地提升了最先進LMMs在多樣化任務挑戰性基準上的表現。
English
In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)' powerful reasoning capabilities, deterring LMMs' performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by diverse downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained router does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method "Re-Routing in Test-Time(R2-T2) that locally optimizes the vector of routing weights in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and greatly improves state-of-the-art LMMs' performance on challenging benchmarks of diverse tasks, without training any base-model parameters.

Summary

AI-Generated Summary

PDF475February 28, 2025