扩展模态是实现全模态的正确路径吗？

摘要

全模态语言模型（OLMs）旨在整合并推理多种输入模态——如文本、图像、视频和音频——同时保持强大的语言能力。尽管近期取得了一些进展，现有模型，尤其是开源模型，仍远未实现真正的全模态，难以在训练时未涉及的特定模态对之外进行泛化，或在处理多模态输入时达到强劲性能。我们研究了模态扩展这一训练多模态模型的主导技术的影响，即对现成的语言模型进行目标领域和语言数据的微调。具体而言，我们探讨了三个关键问题：（1）模态扩展是否会损害核心语言能力？（2）模型合并能否有效整合独立微调的模态特定模型，以实现全模态？（3）与顺序扩展相比，全模态扩展是否能带来更好的知识共享与泛化能力？通过大量实验，我们分析了这些权衡，并为利用现有方法实现真正全模态的可行性提供了洞见。

English

Omni-modal language models (OLMs) aim to integrate and reason over diverse input modalities--such as text, images, video, and audio--while maintaining strong language capabilities. Despite recent advancements, existing models, especially open-source ones, remain far from true omni-modality, struggling to generalize beyond the specific modality pairs they are trained on or to achieve strong performance when processing multi-modal inputs. We study the effect of extending modality, the dominant technique for training multimodal models, where an off-the-shelf language model is fine-tuned on target-domain and language data. Specifically, we investigate three key questions: (1) Does modality extension compromise core language abilities? (2) Can model merging effectively integrate independently fine-tuned modality-specific models to achieve omni-modality? (3) Does omni-modality extension lead to better knowledge sharing and generalization compared to sequential extension? Through extensive experiments, we analyze these trade-offs and provide insights into the feasibility of achieving true omni-modality using current approaches.

扩展模态是实现全模态的正确路径吗？

Is Extending Modality The Right Path Towards Omni-Modality?

摘要

Support