VisCodex: 視覚モデルとコーディングモデルの統合によるマルチモーダルコード生成

要旨

マルチモーダル大規模言語モデル（MLLMs）は、視覚的およびテキスト的な理解の統合を大幅に進展させてきました。しかし、マルチモーダル入力からコードを生成する能力は依然として限られています。本研究では、視覚とコーディング言語モデルをシームレスに統合し、MLLMsに強力なマルチモーダルコード生成能力を付与する統一フレームワーク「VisCodex」を紹介します。タスクベクトルベースのモデル統合技術を活用し、最先端のコーディングLLMを強力な視覚言語バックボーンに統合することで、視覚的理解と高度なコーディングスキルの両方を維持します。トレーニングと評価を支援するため、59万8千のサンプルを含む大規模で多様な「Multimodal Coding Dataset（MCD）」を導入します。これには、高品質なHTMLコード、チャート画像とコードのペア、画像拡張されたStackOverflowのQA、およびアルゴリズム問題が含まれます。さらに、テキストと視覚的コンテキストの微妙な理解を必要とする、視覚的にリッチな現実世界のプログラミング問題に特化した新規で挑戦的なベンチマーク「InfiBench-V」を提案します。広範な実験により、VisCodexがオープンソースのMLLMsの中で最先端の性能を達成し、GPT-4oのようなプロプライエタリモデルに近づくことが示され、我々のモデル統合戦略と新しいデータセットの有効性が強調されています。

English

Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and challenging benchmark specifically designed to assess models on visually-rich, real-world programming questions that demand a nuanced understanding of both textual and visual contexts. Extensive experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o, highlighting the effectiveness of our model merging strategy and new datasets.

VisCodex: 視覚モデルとコーディングモデルの統合によるマルチモーダルコード生成

VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

要旨

Support