大規模言語モデルを用いたシンボリックグラフィックスプログラミング

要旨

大規模言語モデル（LLM）はプログラム合成において優れた能力を発揮しますが、正確な視覚的コンテンツをレンダリングするシンボリックグラフィックスプログラム（SGP）を生成する能力については、まだ十分に研究されていません。本研究では、自然言語の記述からSGPを生成することを目的としたシンボリックグラフィックスプログラミングに焦点を当てます。このタスクは、LLMが視覚世界をどのように理解しているかを探るためのレンズとしても機能し、SGPからレンダリングされた画像を生成するよう促すことでその理解を深めます。様々なSGPの中でも、本論文ではスケーラブルベクターグラフィックス（SVG）に限定して検討します。まず、LLMがSGPを生成できる範囲を調査します。そのために、オブジェクトの忠実度、シーンの忠実度、構成性（属性のバインド、空間関係、数値処理）をカバーする包括的なベンチマークであるSGP-GenBenchを導入します。SGP-GenBenchにおいて、最先端のプロプライエタリモデルがオープンソースモデルを大幅に上回り、その性能は一般的なコーディング能力とよく相関していることが明らかになりました。このギャップに動機づけられ、LLMのSGP生成能力を向上させることを目指します。我々は、検証可能な報酬を用いた強化学習（RL）アプローチを提案します。このアプローチでは、フォーマットの妥当性ゲートがレンダリング可能なSVGを保証し、クロスモーダル報酬が強力な視覚エンコーダ（例えば、テキストと画像の整合性を図るSigLIP、画像間の整合性を図るDINO）を介してテキストとレンダリングされた画像を整合させます。この手法をQwen-2.5-7Bに適用した結果、SVGの生成品質と意味論が大幅に向上し、最先端システムと同等の性能を達成しました。さらに、トレーニングダイナミクスを分析し、RLが（i）オブジェクトを制御可能なプリミティブに細かく分解すること、（ii）シーンの一貫性を向上させる文脈的詳細を誘導することを示しました。我々の結果は、シンボリックグラフィックスプログラミングがクロスモーダルグラウンディングに対する正確で解釈可能なレンズを提供することを実証しています。

English

Large language models (LLMs) excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) that render into precise visual content remains underexplored. We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description. This task also serves as a lens into how LLMs understand the visual world by prompting them to generate images rendered from SGPs. Among various SGPs, our paper sticks to scalable vector graphics (SVGs). We begin by examining the extent to which LLMs can generate SGPs. To this end, we introduce SGP-GenBench, a comprehensive benchmark covering object fidelity, scene fidelity, and compositionality (attribute binding, spatial relations, numeracy). On SGP-GenBench, we discover that frontier proprietary models substantially outperform open-source models, and performance correlates well with general coding capabilities. Motivated by this gap, we aim to improve LLMs' ability to generate SGPs. We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG, and a cross-modal reward aligns text and the rendered image via strong vision encoders (e.g., SigLIP for text-image and DINO for image-image). Applied to Qwen-2.5-7B, our method substantially improves SVG generation quality and semantics, achieving performance on par with frontier systems. We further analyze training dynamics, showing that RL induces (i) finer decomposition of objects into controllable primitives and (ii) contextual details that improve scene coherence. Our results demonstrate that symbolic graphics programming offers a precise and interpretable lens on cross-modal grounding.

大規模言語モデルを用いたシンボリックグラフィックスプログラミング

Symbolic Graphics Programming with Large Language Models

要旨

Support