VisCodex：通过融合视觉与编码模型实现统一的多模态代码生成

摘要

多模态大语言模型（MLLMs）在视觉与文本理解的融合方面取得了显著进展。然而，其在多模态输入下生成代码的能力仍显不足。本研究提出了VisCodex，一个统一框架，无缝整合视觉与编程语言模型，赋予MLLMs强大的多模态代码生成能力。通过基于任务向量的模型融合技术，我们将顶尖的编程大语言模型融入强大的视觉语言骨干网络，同时保留了视觉理解与高级编程技能。为支持训练与评估，我们引入了多模态编码数据集（MCD），这是一个包含598k样本的大规模多样化集合，涵盖高质量HTML代码、图表图像-代码对、图像增强的StackOverflow问答以及算法问题。此外，我们提出了InfiBench-V，一个新颖且具有挑战性的基准测试，专门设计用于评估模型在视觉丰富、现实世界编程问题上的表现，这些问题要求对文本和视觉上下文有细致入微的理解。大量实验表明，VisCodex在开源MLLMs中达到了最先进的性能，并接近GPT-4o等专有模型，凸显了我们模型融合策略及新数据集的有效性。

English

Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and challenging benchmark specifically designed to assess models on visually-rich, real-world programming questions that demand a nuanced understanding of both textual and visual contexts. Extensive experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o, highlighting the effectiveness of our model merging strategy and new datasets.

VisCodex：通过融合视觉与编码模型实现统一的多模态代码生成

VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

摘要

Support