VisCodex：通過融合視覺與編碼模型實現統一的多模態代碼生成

摘要

多模態大型語言模型（MLLMs）在視覺與文本理解的整合方面取得了顯著進展。然而，它們從多模態輸入生成程式碼的能力仍然有限。在本研究中，我們引入了VisCodex，這是一個無縫融合視覺與程式語言模型的統一框架，旨在賦予MLLMs強大的多模態程式碼生成能力。利用基於任務向量的模型融合技術，我們將最先進的程式碼LLM整合到一個強大的視覺語言骨幹中，同時保留了視覺理解與高級程式設計技能。為了支持訓練與評估，我們推出了多模態程式設計數據集（MCD），這是一個包含598k樣本的大規模多樣化集合，涵蓋高品質的HTML程式碼、圖表圖像-程式碼對、圖像增強的StackOverflow問答以及算法問題。此外，我們提出了InfiBench-V，這是一個新穎且具挑戰性的基準測試，專門設計用於評估模型在視覺豐富、現實世界的程式設計問題上的表現，這些問題需要對文本與視覺上下文有細緻的理解。大量實驗表明，VisCodex在開源MLLMs中達到了最先進的性能，並接近如GPT-4o等專有模型，這凸顯了我們模型融合策略與新數據集的有效性。

English

Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and challenging benchmark specifically designed to assess models on visually-rich, real-world programming questions that demand a nuanced understanding of both textual and visual contexts. Extensive experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o, highlighting the effectiveness of our model merging strategy and new datasets.

VisCodex：通過融合視覺與編碼模型實現統一的多模態代碼生成

VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

摘要

Support