DuetSVG：基于内部视觉引导的统一多模态SVG生成技术

摘要

近期基于视觉语言模型（VLM）的SVG生成方法取得了显著成果。然而，由于这类方法在解码过程中仅生成文本而缺乏视觉信号，往往难以处理复杂语义，导致生成的SVG图像在视觉吸引力与几何一致性方面存在不足。我们提出DuetSVG——一种统一的多模态模型，能够以端到端方式联合生成图像标记与对应的SVG标记。该模型在图像和SVG数据集上进行了联合训练。在推理阶段，我们采用新型测试时缩放策略，利用模型自身的视觉预测结果作为引导来提升SVG解码质量。大量实验表明，本方法在各类应用场景中均优于现有技术，生成的SVG图像兼具视觉保真度、语义对齐性和语法简洁性。

English

Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.