DuetSVG：基于内部视觉引导的统一多模态SVG生成技术

摘要

近期基于视觉语言模型（VLM）的SVG生成方法取得了显著成果。然而，由于这类方法在解码过程中仅生成文本而缺乏视觉信号，往往难以处理复杂语义，导致生成的SVG在视觉吸引力与几何一致性方面存在不足。我们提出DuetSVG——一种统一的多模态模型，能以端到端方式同步生成图像标记及对应的SVG标记。该模型基于图像与SVG混合数据集进行训练，在推理阶段采用创新的测试时缩放策略，利用模型自身生成的视觉预测作为引导来提升SVG解码质量。大量实验表明，我们的方法在各类应用场景中均优于现有技术，所生成的SVG兼具视觉保真度、语义对齐性和语法简洁性。

English

Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.

DuetSVG：基于内部视觉引导的统一多模态SVG生成技术

DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

摘要

Support