面向渲染的强化学习在矢量图形生成中的应用
Rendering-Aware Reinforcement Learning for Vector Graphics Generation
May 27, 2025
作者: Juan A. Rodriguez, Haotian Zhang, Abhay Puri, Aarash Feizi, Rishav Pramanik, Pascal Wichmann, Arnab Mondal, Mohammad Reza Samsami, Rabiul Awal, Perouz Taslakian, Spandana Gella, Sai Rajeswar, David Vazquez, Christopher Pal, Marco Pedersoli
cs.AI
摘要
可缩放矢量图形(SVG)作为一种强大的格式,能够将视觉设计以可解释的代码形式呈现。近期,视觉-语言模型(VLMs)的进展通过将问题框架化为代码生成任务,并利用大规模预训练,实现了高质量的SVG生成。VLMs特别适合这一任务,因为它们既能捕捉全局语义,又能识别细粒度的视觉模式,同时在视觉、自然语言和代码领域之间传递知识。然而,现有的VLM方法往往难以生成忠实且高效的SVG,因为它们在训练过程中从未观察过渲染后的图像。尽管自回归SVG代码生成的可微分渲染技术尚未实现,但渲染输出仍可与原始输入进行比较,从而提供适用于强化学习(RL)的评估反馈。我们提出了RLRF(基于渲染反馈的强化学习),这是一种RL方法,通过利用渲染SVG输出的反馈,增强自回归VLMs中的SVG生成能力。给定输入图像,模型生成SVG序列,这些序列被渲染并与原始图像比较以计算奖励。这种视觉保真度反馈引导模型生成更准确、高效且语义连贯的SVG。RLRF显著优于监督微调,解决了常见的失败模式,实现了具有强大结构理解和泛化能力的精确、高质量SVG生成。
English
Scalable Vector Graphics (SVG) offer a powerful format for representing
visual designs as interpretable code. Recent advances in vision-language models
(VLMs) have enabled high-quality SVG generation by framing the problem as a
code generation task and leveraging large-scale pretraining. VLMs are
particularly suitable for this task as they capture both global semantics and
fine-grained visual patterns, while transferring knowledge across vision,
natural language, and code domains. However, existing VLM approaches often
struggle to produce faithful and efficient SVGs because they never observe the
rendered images during training. Although differentiable rendering for
autoregressive SVG code generation remains unavailable, rendered outputs can
still be compared to original inputs, enabling evaluative feedback suitable for
reinforcement learning (RL). We introduce RLRF(Reinforcement Learning from
Rendering Feedback), an RL method that enhances SVG generation in
autoregressive VLMs by leveraging feedback from rendered SVG outputs. Given an
input image, the model generates SVG roll-outs that are rendered and compared
to the original image to compute a reward. This visual fidelity feedback guides
the model toward producing more accurate, efficient, and semantically coherent
SVGs. RLRF significantly outperforms supervised fine-tuning, addressing common
failure modes and enabling precise, high-quality SVG generation with strong
structural understanding and generalization.Summary
AI-Generated Summary