擴展模態是否邁向全模態的正確途徑？

摘要

全模态語言模型（OLMs）旨在整合並推理多樣化的輸入模態——如文本、圖像、視頻和音頻——同時保持強大的語言能力。儘管近期取得了進展，現有模型，尤其是開源模型，仍遠未實現真正的全模態，難以在訓練所針對的特定模態對之外進行泛化，或在處理多模態輸入時達到強勁性能。我們研究了擴展模態這一訓練多模態模型的主導技術的效果，其中現成的語言模型在目標領域和語言數據上進行微調。具體而言，我們探討了三個關鍵問題：（1）模態擴展是否會損害核心語言能力？（2）模型合併能否有效整合獨立微調的模態特定模型以實現全模態？（3）與順序擴展相比，全模態擴展是否會帶來更好的知識共享和泛化能力？通過大量實驗，我們分析了這些權衡，並對使用當前方法實現真正全模態的可行性提供了見解。

English

Omni-modal language models (OLMs) aim to integrate and reason over diverse input modalities--such as text, images, video, and audio--while maintaining strong language capabilities. Despite recent advancements, existing models, especially open-source ones, remain far from true omni-modality, struggling to generalize beyond the specific modality pairs they are trained on or to achieve strong performance when processing multi-modal inputs. We study the effect of extending modality, the dominant technique for training multimodal models, where an off-the-shelf language model is fine-tuned on target-domain and language data. Specifically, we investigate three key questions: (1) Does modality extension compromise core language abilities? (2) Can model merging effectively integrate independently fine-tuned modality-specific models to achieve omni-modality? (3) Does omni-modality extension lead to better knowledge sharing and generalization compared to sequential extension? Through extensive experiments, we analyze these trade-offs and provide insights into the feasibility of achieving true omni-modality using current approaches.

擴展模態是否邁向全模態的正確途徑？

Is Extending Modality The Right Path Towards Omni-Modality?

摘要

Support