벡터 그래픽 생성을 위한 렌더링 인지 강화 학습

초록

확장 가능한 벡터 그래픽스(SVG)는 시각적 디자인을 해석 가능한 코드로 표현하기 위한 강력한 형식을 제공한다. 최근 비전-언어 모델(VLM)의 발전은 이 문제를 코드 생성 작업으로 프레이밍하고 대규모 사전 학습을 활용함으로써 고품질 SVG 생성을 가능하게 했다. VLM은 전역적 의미론과 세밀한 시각적 패턴을 모두 포착하면서 비전, 자연어, 코드 도메인 간의 지식을 전이할 수 있기 때문에 이 작업에 특히 적합하다. 그러나 기존의 VLM 접근법은 학습 중에 렌더링된 이미지를 관찰하지 않기 때문에 충실하고 효율적인 SVG를 생성하는 데 어려움을 겪는 경우가 많다. 자동회귀적 SVG 코드 생성을 위한 미분 가능한 렌더링은 아직 사용할 수 없지만, 렌더링된 출력물을 원본 입력과 비교하여 강화 학습(RL)에 적합한 평가적 피드백을 제공할 수 있다. 우리는 렌더링 피드백을 활용하여 자동회귀적 VLM에서 SVG 생성을 향상시키는 RL 방법인 RLRF(Reinforcement Learning from Rendering Feedback)를 소개한다. 입력 이미지가 주어지면 모델은 SVG 롤아웃을 생성하고 이를 렌더링하여 원본 이미지와 비교하여 보상을 계산한다. 이 시각적 충실도 피드백은 모델이 더 정확하고 효율적이며 의미론적으로 일관된 SVG를 생성하도록 유도한다. RLRF는 지도 미세 조정을 크게 능가하며, 일반적인 실패 모드를 해결하고 강력한 구조적 이해와 일반화를 통해 정밀하고 고품질의 SVG 생성을 가능하게 한다.

English

Scalable Vector Graphics (SVG) offer a powerful format for representing visual designs as interpretable code. Recent advances in vision-language models (VLMs) have enabled high-quality SVG generation by framing the problem as a code generation task and leveraging large-scale pretraining. VLMs are particularly suitable for this task as they capture both global semantics and fine-grained visual patterns, while transferring knowledge across vision, natural language, and code domains. However, existing VLM approaches often struggle to produce faithful and efficient SVGs because they never observe the rendered images during training. Although differentiable rendering for autoregressive SVG code generation remains unavailable, rendered outputs can still be compared to original inputs, enabling evaluative feedback suitable for reinforcement learning (RL). We introduce RLRF(Reinforcement Learning from Rendering Feedback), an RL method that enhances SVG generation in autoregressive VLMs by leveraging feedback from rendered SVG outputs. Given an input image, the model generates SVG roll-outs that are rendered and compared to the original image to compute a reward. This visual fidelity feedback guides the model toward producing more accurate, efficient, and semantically coherent SVGs. RLRF significantly outperforms supervised fine-tuning, addressing common failure modes and enabling precise, high-quality SVG generation with strong structural understanding and generalization.

벡터 그래픽 생성을 위한 렌더링 인지 강화 학습

Rendering-Aware Reinforcement Learning for Vector Graphics Generation

초록

Support