基於多任務多獎勵強化學習的SVG-LLM可靠推理方法

摘要

随着视觉语言模型的快速发展，越来越多研究开始探索其在SVG生成任务中的潜力。尽管现有方法通过构建大规模SVG数据集和引入SVG专用标记符来提升性能，但仍存在泛化能力有限、代码输出路径冗余以及缺乏显式推理等问题。本研究提出CTRL-S（SVG思维链强化学习框架），通过引入思维链机制在SVG生成过程中显式呈现模型的推理过程。为支撑这种结构化推理，我们构建了包含14.5万样本的高质量数据集SVG-Sophia，涵盖SVG代码优化、文本转SVG和图像转SVG三类任务。通过训练模型生成组级结构化SVG代码，CTRL-S显著提升了结构连贯性与视觉保真度。此外，我们采用GRPO算法并设计多奖励优化框架，整合DINO视觉特征、图文相似度、格式规范及代码效率等多重奖励机制。通过联合多奖励优化与多任务训练，该方法系统性地提升了整体生成能力。大量实验表明，CTRL-S在任务成功率、SVG代码质量和视觉保真度方面均优于现有方法。

English

With the rapid advancement of vision-language models, an increasing number of studies have explored their potential for SVG generation tasks. Although existing approaches improve performance by constructing large-scale SVG datasets and introducing SVG-specific tokens, they still suffer from limited generalization, redundant paths in code outputs, and a lack of explicit reasoning. In this work, we present CTRL-S (Chain-of-Thought Reinforcement Learning for SVG), a unified framework that introduces a chain-of-thought mechanism to explicitly expose the model's reasoning process during SVG generation. To support this structured reasoning, we construct SVG-Sophia, a high-quality dataset containing 145K samples across SVG code refinement, Text-to-SVG, and Image-to-SVG tasks. By training the model to generate group-level structured SVG code, CTRL-S significantly improves structural coherence and visual fidelity. Furthermore, we adopt the GRPO algorithm and design a multi-reward optimization framework, incorporating DINO, image-text similarity, format, and code efficiency rewards. Through joint multi-reward optimization and multi-task training, our approach systematically enhances overall generation capabilities. Extensive experiments show that CTRL-S outperforms existing methods, achieving higher task success rates, superior SVG code quality, and exceptional visual fidelity.

基於多任務多獎勵強化學習的SVG-LLM可靠推理方法

Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning

摘要

Support