基于大语言模型的符号化图形编程

摘要

大型语言模型（LLMs）在程序合成方面表现出色，但其生成能够精确渲染视觉内容的符号图形程序（SGPs）的能力尚未得到充分探索。我们研究了符号图形编程，其目标是从自然语言描述中生成SGP。这一任务也为我们提供了一个视角，通过促使LLMs生成由SGPs渲染的图像，来理解它们如何理解视觉世界。在众多SGPs中，本文专注于可缩放矢量图形（SVGs）。我们首先考察了LLMs生成SGPs的能力。为此，我们引入了SGP-GenBench，一个涵盖对象保真度、场景保真度和组合性（属性绑定、空间关系、数感）的综合基准。在SGP-GenBench上，我们发现前沿的专有模型显著优于开源模型，且性能与通用编码能力高度相关。受此差距的启发，我们旨在提升LLMs生成SGPs的能力。我们提出了一种带有可验证奖励的强化学习（RL）方法，其中格式有效性门确保SVG可渲染，跨模态奖励通过强大的视觉编码器（如用于文本-图像的SigLIP和用于图像-图像的DINO）对齐文本与渲染图像。将该方法应用于Qwen-2.5-7B，我们的方法显著提高了SVG生成的质量和语义，达到了与前沿系统相当的性能。我们进一步分析了训练动态，表明RL诱导了（i）将对象更精细地分解为可控基元，以及（ii）提升场景一致性的上下文细节。我们的结果表明，符号图形编程为跨模态接地提供了一个精确且可解释的视角。

English

Large language models (LLMs) excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) that render into precise visual content remains underexplored. We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description. This task also serves as a lens into how LLMs understand the visual world by prompting them to generate images rendered from SGPs. Among various SGPs, our paper sticks to scalable vector graphics (SVGs). We begin by examining the extent to which LLMs can generate SGPs. To this end, we introduce SGP-GenBench, a comprehensive benchmark covering object fidelity, scene fidelity, and compositionality (attribute binding, spatial relations, numeracy). On SGP-GenBench, we discover that frontier proprietary models substantially outperform open-source models, and performance correlates well with general coding capabilities. Motivated by this gap, we aim to improve LLMs' ability to generate SGPs. We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG, and a cross-modal reward aligns text and the rendered image via strong vision encoders (e.g., SigLIP for text-image and DINO for image-image). Applied to Qwen-2.5-7B, our method substantially improves SVG generation quality and semantics, achieving performance on par with frontier systems. We further analyze training dynamics, showing that RL induces (i) finer decomposition of objects into controllable primitives and (ii) contextual details that improve scene coherence. Our results demonstrate that symbolic graphics programming offers a precise and interpretable lens on cross-modal grounding.

基于大语言模型的符号化图形编程

Symbolic Graphics Programming with Large Language Models

摘要

Support