基於大型語言模型的符號圖形程式設計

摘要

大型語言模型（LLMs）在程序合成方面表現出色，但其生成符號圖形程序（SGPs）以呈現精確視覺內容的能力仍未被充分探索。我們研究符號圖形編程，其目標是從自然語言描述生成SGPs。此任務也作為一個透鏡，通過提示LLMs生成從SGPs渲染的圖像，來理解它們如何理解視覺世界。在各種SGPs中，本文專注於可縮放向量圖形（SVGs）。我們首先檢視LLMs生成SGPs的程度。為此，我們引入了SGP-GenBench，一個涵蓋對象保真度、場景保真度和組合性（屬性綁定、空間關係、數值能力）的綜合基準。在SGP-GenBench上，我們發現前沿的專有模型顯著優於開源模型，且性能與一般編碼能力高度相關。受此差距啟發，我們旨在提升LLMs生成SGPs的能力。我們提出了一種帶有可驗證獎勵的強化學習（RL）方法，其中格式有效性閘門確保可渲染的SVG，而跨模態獎勵通過強大的視覺編碼器（如用於文本-圖像的SigLIP和用於圖像-圖像的DINO）對齊文本與渲染圖像。應用於Qwen-2.5-7B，我們的方法顯著提高了SVG生成質量和語義，達到了與前沿系統相當的性能。我們進一步分析訓練動態，顯示RL誘導了（i）將對象更精細地分解為可控基元，以及（ii）提升場景連貫性的上下文細節。我們的結果表明，符號圖形編程提供了一個精確且可解釋的跨模態接地透鏡。

English

Large language models (LLMs) excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) that render into precise visual content remains underexplored. We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description. This task also serves as a lens into how LLMs understand the visual world by prompting them to generate images rendered from SGPs. Among various SGPs, our paper sticks to scalable vector graphics (SVGs). We begin by examining the extent to which LLMs can generate SGPs. To this end, we introduce SGP-GenBench, a comprehensive benchmark covering object fidelity, scene fidelity, and compositionality (attribute binding, spatial relations, numeracy). On SGP-GenBench, we discover that frontier proprietary models substantially outperform open-source models, and performance correlates well with general coding capabilities. Motivated by this gap, we aim to improve LLMs' ability to generate SGPs. We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG, and a cross-modal reward aligns text and the rendered image via strong vision encoders (e.g., SigLIP for text-image and DINO for image-image). Applied to Qwen-2.5-7B, our method substantially improves SVG generation quality and semantics, achieving performance on par with frontier systems. We further analyze training dynamics, showing that RL induces (i) finer decomposition of objects into controllable primitives and (ii) contextual details that improve scene coherence. Our results demonstrate that symbolic graphics programming offers a precise and interpretable lens on cross-modal grounding.

基於大型語言模型的符號圖形程式設計

Symbolic Graphics Programming with Large Language Models

摘要

Support