ChatPaper.aiChatPaper

X-Fusion:为冻结大型语言模型引入新模态

X-Fusion: Introducing New Modality to Frozen Large Language Models

April 29, 2025
作者: Sicheng Mo, Thao Nguyen, Xun Huang, Siddharth Srinivasan Iyer, Yijun Li, Yuchen Liu, Abhishek Tandon, Eli Shechtman, Krishna Kumar Singh, Yong Jae Lee, Bolei Zhou, Yuheng Li
cs.AI

摘要

我们提出了X-Fusion框架,该框架扩展了预训练大语言模型(LLMs)以处理多模态任务,同时保持其语言能力。X-Fusion采用双塔设计,配备模态特定权重,冻结LLM参数的同时整合视觉特定信息,用于理解和生成任务。实验表明,X-Fusion在图像到文本和文本到图像任务上均持续优于其他架构。我们发现,融入侧重于理解的数据能提升生成质量,减少图像数据噪声可增强整体性能,而特征对齐能加速小模型的收敛,但对大模型影响甚微。这些发现为构建高效统一的多模态模型提供了宝贵洞见。
English
We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM's parameters frozen while integrating vision-specific information for both understanding and generation. Our experiments demonstrate that X-Fusion consistently outperforms alternative architectures on both image-to-text and text-to-image tasks. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.

Summary

AI-Generated Summary

PDF41April 30, 2025