面向渲染的強化學習在向量圖形生成中的應用
Rendering-Aware Reinforcement Learning for Vector Graphics Generation
May 27, 2025
作者: Juan A. Rodriguez, Haotian Zhang, Abhay Puri, Aarash Feizi, Rishav Pramanik, Pascal Wichmann, Arnab Mondal, Mohammad Reza Samsami, Rabiul Awal, Perouz Taslakian, Spandana Gella, Sai Rajeswar, David Vazquez, Christopher Pal, Marco Pedersoli
cs.AI
摘要
可縮放向量圖形(SVG)提供了一種強大的格式,以可解釋的代碼形式呈現視覺設計。近年來,視覺-語言模型(VLMs)的進展使得高質量的SVG生成成為可能,這通過將問題框架化為代碼生成任務並利用大規模預訓練來實現。VLMs特別適合此任務,因為它們既能捕捉全局語義,又能細緻地理解視覺模式,同時在視覺、自然語言和代碼領域之間轉移知識。然而,現有的VLM方法在生成忠實且高效的SVG時常常遇到困難,因為它們在訓練過程中從未觀察到渲染後的圖像。儘管針對自回歸SVG代碼生成的可微分渲染技術尚未出現,但渲染輸出仍可與原始輸入進行比較,從而提供適合強化學習(RL)的評估反饋。我們提出了基於渲染反饋的強化學習(RLRF),這是一種RL方法,通過利用渲染SVG輸出的反饋來增強自回歸VLMs中的SVG生成。給定輸入圖像,模型生成SVG展開,這些展開被渲染並與原始圖像進行比較以計算獎勵。這種視覺保真度反饋引導模型生成更準確、高效且語義連貫的SVG。RLRF顯著優於監督微調,解決了常見的失敗模式,並實現了具有強結構理解和泛化能力的精確、高質量SVG生成。
English
Scalable Vector Graphics (SVG) offer a powerful format for representing
visual designs as interpretable code. Recent advances in vision-language models
(VLMs) have enabled high-quality SVG generation by framing the problem as a
code generation task and leveraging large-scale pretraining. VLMs are
particularly suitable for this task as they capture both global semantics and
fine-grained visual patterns, while transferring knowledge across vision,
natural language, and code domains. However, existing VLM approaches often
struggle to produce faithful and efficient SVGs because they never observe the
rendered images during training. Although differentiable rendering for
autoregressive SVG code generation remains unavailable, rendered outputs can
still be compared to original inputs, enabling evaluative feedback suitable for
reinforcement learning (RL). We introduce RLRF(Reinforcement Learning from
Rendering Feedback), an RL method that enhances SVG generation in
autoregressive VLMs by leveraging feedback from rendered SVG outputs. Given an
input image, the model generates SVG roll-outs that are rendered and compared
to the original image to compute a reward. This visual fidelity feedback guides
the model toward producing more accurate, efficient, and semantically coherent
SVGs. RLRF significantly outperforms supervised fine-tuning, addressing common
failure modes and enabling precise, high-quality SVG generation with strong
structural understanding and generalization.Summary
AI-Generated Summary