X-Fusion：為凍結大型語言模型引入新模態

摘要

我們提出了X-Fusion框架，該框架擴展了預訓練的大型語言模型（LLMs）以處理多模態任務，同時保留其語言能力。X-Fusion採用雙塔設計，配備模態專用權重，保持LLM參數不變，同時整合視覺專用信息以實現理解和生成。我們的實驗表明，X-Fusion在圖像到文本和文本到圖像任務上始終優於其他架構。我們發現，融入以理解為重點的數據能提升生成質量，減少圖像數據噪聲可增強整體性能，而特徵對齊則加速了較小模型的收斂，但對較大模型影響甚微。這些發現為構建高效統一的跨模態模型提供了寶貴洞見。

English

We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM's parameters frozen while integrating vision-specific information for both understanding and generation. Our experiments demonstrate that X-Fusion consistently outperforms alternative architectures on both image-to-text and text-to-image tasks. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.

X-Fusion：為凍結大型語言模型引入新模態

X-Fusion: Introducing New Modality to Frozen Large Language Models

摘要

Support