ChatPaper.aiChatPaper

扩展模态是实现全模态的正确路径吗?

Is Extending Modality The Right Path Towards Omni-Modality?

June 2, 2025
作者: Tinghui Zhu, Kai Zhang, Muhao Chen, Yu Su
cs.AI

摘要

全模态语言模型(OLMs)旨在整合并推理多种输入模态——如文本、图像、视频和音频——同时保持强大的语言能力。尽管近期取得了一些进展,现有模型,尤其是开源模型,仍远未实现真正的全模态,难以在训练时未涉及的特定模态对之外进行泛化,或在处理多模态输入时达到强劲性能。我们研究了模态扩展这一训练多模态模型的主导技术的影响,即对现成的语言模型进行目标领域和语言数据的微调。具体而言,我们探讨了三个关键问题:(1)模态扩展是否会损害核心语言能力?(2)模型合并能否有效整合独立微调的模态特定模型,以实现全模态?(3)与顺序扩展相比,全模态扩展是否能带来更好的知识共享与泛化能力?通过大量实验,我们分析了这些权衡,并为利用现有方法实现真正全模态的可行性提供了洞见。
English
Omni-modal language models (OLMs) aim to integrate and reason over diverse input modalities--such as text, images, video, and audio--while maintaining strong language capabilities. Despite recent advancements, existing models, especially open-source ones, remain far from true omni-modality, struggling to generalize beyond the specific modality pairs they are trained on or to achieve strong performance when processing multi-modal inputs. We study the effect of extending modality, the dominant technique for training multimodal models, where an off-the-shelf language model is fine-tuned on target-domain and language data. Specifically, we investigate three key questions: (1) Does modality extension compromise core language abilities? (2) Can model merging effectively integrate independently fine-tuned modality-specific models to achieve omni-modality? (3) Does omni-modality extension lead to better knowledge sharing and generalization compared to sequential extension? Through extensive experiments, we analyze these trade-offs and provide insights into the feasibility of achieving true omni-modality using current approaches.

Summary

AI-Generated Summary

PDF192June 9, 2025