ChatPaper.aiChatPaper

擴展模態是否邁向全模態的正確途徑?

Is Extending Modality The Right Path Towards Omni-Modality?

June 2, 2025
作者: Tinghui Zhu, Kai Zhang, Muhao Chen, Yu Su
cs.AI

摘要

全模态語言模型(OLMs)旨在整合並推理多樣化的輸入模態——如文本、圖像、視頻和音頻——同時保持強大的語言能力。儘管近期取得了進展,現有模型,尤其是開源模型,仍遠未實現真正的全模態,難以在訓練所針對的特定模態對之外進行泛化,或在處理多模態輸入時達到強勁性能。我們研究了擴展模態這一訓練多模態模型的主導技術的效果,其中現成的語言模型在目標領域和語言數據上進行微調。具體而言,我們探討了三個關鍵問題:(1)模態擴展是否會損害核心語言能力?(2)模型合併能否有效整合獨立微調的模態特定模型以實現全模態?(3)與順序擴展相比,全模態擴展是否會帶來更好的知識共享和泛化能力?通過大量實驗,我們分析了這些權衡,並對使用當前方法實現真正全模態的可行性提供了見解。
English
Omni-modal language models (OLMs) aim to integrate and reason over diverse input modalities--such as text, images, video, and audio--while maintaining strong language capabilities. Despite recent advancements, existing models, especially open-source ones, remain far from true omni-modality, struggling to generalize beyond the specific modality pairs they are trained on or to achieve strong performance when processing multi-modal inputs. We study the effect of extending modality, the dominant technique for training multimodal models, where an off-the-shelf language model is fine-tuned on target-domain and language data. Specifically, we investigate three key questions: (1) Does modality extension compromise core language abilities? (2) Can model merging effectively integrate independently fine-tuned modality-specific models to achieve omni-modality? (3) Does omni-modality extension lead to better knowledge sharing and generalization compared to sequential extension? Through extensive experiments, we analyze these trade-offs and provide insights into the feasibility of achieving true omni-modality using current approaches.
PDF202June 9, 2025