대규모 언어 모델을 활용한 심볼릭 그래픽스 프로그래밍

초록

대규모 언어 모델(LLMs)은 프로그램 합성에서 뛰어난 성능을 보이지만, 정확한 시각적 콘텐츠를 렌더링하는 심볼릭 그래픽 프로그램(SGPs)을 생성하는 능력은 아직 충분히 탐구되지 않았습니다. 우리는 자연어 설명으로부터 SGP를 생성하는 것을 목표로 하는 심볼릭 그래픽 프로그래밍을 연구합니다. 이 작업은 또한 LLMs가 SGP에서 렌더링된 이미지를 생성하도록 유도함으로써 시각적 세계를 이해하는 방식을 들여다보는 역할도 합니다. 다양한 SGP 중에서 본 논문은 확장 가능한 벡터 그래픽(SVG)에 초점을 맞춥니다. 먼저, LLMs가 SGP를 생성할 수 있는 정도를 살펴봅니다. 이를 위해 우리는 객체 충실도, 장면 충실도, 구성성(속성 바인딩, 공간 관계, 수리 능력)을 포괄하는 SGP-GenBench 벤치마크를 소개합니다. SGP-GenBench에서 최신 상용 모델이 오픈소스 모델을 크게 앞지르며, 성능이 일반적인 코딩 능력과 잘 상관관계를 보인다는 사실을 발견했습니다. 이러한 격차에 동기를 부여받아, 우리는 LLMs의 SGP 생성 능력을 향상시키고자 합니다. 우리는 검증 가능한 보상을 통한 강화 학습(RL) 접근법을 제안합니다. 여기서 형식 유효성 게이트는 렌더링 가능한 SVG를 보장하고, 교차 모달 보상은 강력한 비전 인코더(예: 텍스트-이미지용 SigLIP, 이미지-이미지용 DINO)를 통해 텍스트와 렌더링된 이미지를 정렬합니다. 이 방법을 Qwen-2.5-7B에 적용한 결과, SVG 생성 품질과 의미론이 크게 개선되어 최신 시스템과 동등한 성능을 달성했습니다. 또한, RL이 (i) 객체를 제어 가능한 기본 요소로 더 세분화하고, (ii) 장면 일관성을 개선하는 문맥적 세부 사항을 유도한다는 훈련 동역학을 분석했습니다. 우리의 결과는 심볼릭 그래픽 프로그래밍이 교차 모달 그라운딩에 대한 정확하고 해석 가능한 렌즈를 제공한다는 것을 보여줍니다.

English

Large language models (LLMs) excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) that render into precise visual content remains underexplored. We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description. This task also serves as a lens into how LLMs understand the visual world by prompting them to generate images rendered from SGPs. Among various SGPs, our paper sticks to scalable vector graphics (SVGs). We begin by examining the extent to which LLMs can generate SGPs. To this end, we introduce SGP-GenBench, a comprehensive benchmark covering object fidelity, scene fidelity, and compositionality (attribute binding, spatial relations, numeracy). On SGP-GenBench, we discover that frontier proprietary models substantially outperform open-source models, and performance correlates well with general coding capabilities. Motivated by this gap, we aim to improve LLMs' ability to generate SGPs. We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG, and a cross-modal reward aligns text and the rendered image via strong vision encoders (e.g., SigLIP for text-image and DINO for image-image). Applied to Qwen-2.5-7B, our method substantially improves SVG generation quality and semantics, achieving performance on par with frontier systems. We further analyze training dynamics, showing that RL induces (i) finer decomposition of objects into controllable primitives and (ii) contextual details that improve scene coherence. Our results demonstrate that symbolic graphics programming offers a precise and interpretable lens on cross-modal grounding.

대규모 언어 모델을 활용한 심볼릭 그래픽스 프로그래밍

Symbolic Graphics Programming with Large Language Models

초록

Support