擴展模態是否邁向全模態的正確途徑?
Is Extending Modality The Right Path Towards Omni-Modality?
June 2, 2025
作者: Tinghui Zhu, Kai Zhang, Muhao Chen, Yu Su
cs.AI
摘要
全模态語言模型(OLMs)旨在整合並推理多樣化的輸入模態——如文本、圖像、視頻和音頻——同時保持強大的語言能力。儘管近期取得了進展,現有模型,尤其是開源模型,仍遠未實現真正的全模態,難以在訓練所針對的特定模態對之外進行泛化,或在處理多模態輸入時達到強勁性能。我們研究了擴展模態這一訓練多模態模型的主導技術的效果,其中現成的語言模型在目標領域和語言數據上進行微調。具體而言,我們探討了三個關鍵問題:(1)模態擴展是否會損害核心語言能力?(2)模型合併能否有效整合獨立微調的模態特定模型以實現全模態?(3)與順序擴展相比,全模態擴展是否會帶來更好的知識共享和泛化能力?通過大量實驗,我們分析了這些權衡,並對使用當前方法實現真正全模態的可行性提供了見解。
English
Omni-modal language models (OLMs) aim to integrate and reason over diverse
input modalities--such as text, images, video, and audio--while maintaining
strong language capabilities. Despite recent advancements, existing models,
especially open-source ones, remain far from true omni-modality, struggling to
generalize beyond the specific modality pairs they are trained on or to achieve
strong performance when processing multi-modal inputs. We study the effect of
extending modality, the dominant technique for training multimodal models,
where an off-the-shelf language model is fine-tuned on target-domain and
language data. Specifically, we investigate three key questions: (1) Does
modality extension compromise core language abilities? (2) Can model merging
effectively integrate independently fine-tuned modality-specific models to
achieve omni-modality? (3) Does omni-modality extension lead to better
knowledge sharing and generalization compared to sequential extension? Through
extensive experiments, we analyze these trade-offs and provide insights into
the feasibility of achieving true omni-modality using current approaches.