VisCodex: 비전 모델과 코딩 모델의 통합을 통한 다중모달 코드 생성

초록

멀티모달 대형 언어 모델(MLLMs)은 시각적 이해와 텍스트 이해의 통합을 크게 발전시켰습니다. 그러나 멀티모달 입력에서 코드를 생성하는 능력은 여전히 제한적입니다. 본 연구에서는 시각과 코딩 언어 모델을 원활하게 통합하여 MLLMs에 강력한 멀티모달 코드 생성 능력을 부여하는 통합 프레임워크인 VisCodex를 소개합니다. 태스크 벡터 기반 모델 병합 기술을 활용하여 최첨단 코딩 LLM을 강력한 시각-언어 백본에 통합하면서도 시각적 이해와 고급 코딩 기술을 모두 보존합니다. 훈련과 평가를 지원하기 위해 598k개의 샘플로 구성된 대규모 및 다양한 멀티모달 코딩 데이터셋(MCD)을 도입했습니다. 이 데이터셋은 고품질 HTML 코드, 차트 이미지-코드 쌍, 이미지가 강화된 StackOverflow QA, 그리고 알고리즘 문제를 포함합니다. 또한, 텍스트와 시각적 맥락의 미묘한 이해를 요구하는 시각적으로 풍부한 실제 프로그래밍 질문에 대해 모델을 평가하기 위해 특별히 설계된 새로운 도전적인 벤치마크인 InfiBench-V를 제안합니다. 광범위한 실험을 통해 VisCodex가 오픈소스 MLLMs 중에서 최첨단 성능을 달성하고 GPT-4o와 같은 독점 모델에 근접하는 성과를 보여주며, 우리의 모델 병합 전략과 새로운 데이터셋의 효과를 입증합니다.

English

Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and challenging benchmark specifically designed to assess models on visually-rich, real-world programming questions that demand a nuanced understanding of both textual and visual contexts. Extensive experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o, highlighting the effectiveness of our model merging strategy and new datasets.

VisCodex: 비전 모델과 코딩 모델의 통합을 통한 다중모달 코드 생성

VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

초록

Support