NL2Codeを超えて：マルチモーダルコードインテリジェンスの構造化サーベイ

要旨

大規模言語モデル（LLM）はテキストからコードへの合成を大幅に進展させたが、現実の多くのプログラミングタスクでは、スクリーンショット、チャート、ベクター図、動画、対話状態といった視覚的アーティファクトを通じて意図が指定される。これらのタスクでは、正しさが構文だけでなく、レイアウト、データ意味論、インタラクション動作、そして実行後に適用されるドメイン固有の制約にも依存するため、モデルは視覚的知覚を実行可能プログラムに結びつける必要がある。本調査では、視覚的に接地された入力と出力の下でコードを生成、編集、洗練、または推論するシステムを対象としたマルチモーダルコードインテリジェンスを検討する。まず、各タスクにおいてコードが果たす役割に基づいてこの分野を定式化し、コードを、レンダリングされたアーティファクト、編集可能な記号構造、科学的表現、中間的な推論トレース、または実行可能なポリシーやツールインターフェースとして区別する。次に、ベンチマークと手法を、グラフィカルユーザーインターフェース、科学可視化、構造化グラフィックス、そしてフロンティアタスクとフレームワークの4つの領域に整理する。この分類法は、成熟したアーティファクト生成問題を、新興のエージェント的・統合的設定に結びつけ、異なるタスクが正しさの証拠をどのように扱うかを比較することを可能にする。今後の展望として、将来の研究は4つの検証中心の方向性から恩恵を受ける可能性があると主張する。マルチシグナル検証は正しさの相補的な証拠を組み合わせることができ、マルチステート検証は実行軌跡全体にわたる振る舞いをテストでき、クロスタスク転送テストは再利用可能な視覚コードスキルを探ることができ、検証可能なエージェントトレースはエージェントの行動が視覚的証拠に基づいているかどうかを明らかにする。これらの方向性は、この分野を単一出力の模倣から、証拠に基づいた実行可能システムへと移行させる可能性がある。進行中のプロジェクトとリソースはhttps://github.com/xjywhu/Awesome-Multimodal-LLM-for-Code{GitHub}で入手可能である。

English

While Large Language Models (LLMs) have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, vector drawings, videos, and interactive states. These tasks require models to connect visual perception to executable programs, because correctness depends not only on syntax but also on layout, data semantics, interaction behavior, and domain-specific constraints that apply after execution. This survey examines Multimodal Code Intelligence, covering systems that generate, edit, refine, or reason with code under visually grounded inputs and outputs. We first formulate the field by the role that code plays in each task, distinguishing code as a rendered artifact, an editable symbolic structure, a scientific representation, an intermediate reasoning trace, or an executable policy or tool interface. We then organize benchmarks and methods into four domains: Graphical User Interface, Scientific Visualization, Structured Graphics, and Frontier Tasks and Frameworks. This taxonomy connects mature artifact-generation problems to emerging agentic and unified settings and allows us to compare how different tasks treat evidence of correctness. Looking ahead, we argue that future research may benefit from four verification-centered directions. Multi-signal validation can combine complementary evidence of correctness, multi-state verification can test behavior across execution trajectories, cross-task transfer testing can probe reusable visual-code skills, and verifiable agent traces can reveal whether agent actions are grounded in visual evidence. Together, these directions may move this field from single-output imitation toward evidence-grounded executable systems. An ongoing project and resources are available on https://github.com/xjywhu/Awesome-Multimodal-LLM-for-Code{GitHub}.