StrokeNUWA：用於向量圖形合成的筆劃分詞

摘要

為了利用語言模型進行視覺合成，傳統方法將點陣圖像資訊通過專門的視覺模組轉換為離散的網格標記，同時干擾了模型捕捉視覺場景真實語義表示的能力。本文認為，圖像的另一種表示形式，即向量圖形，可以有效克服這一限制，通過實現對圖像資訊更自然和語義一致的分割。因此，我們介紹了StrokeNUWA，這是一項開創性工作，探索在向量圖形上更好的視覺表示「筆劃標記」，這種表示在視覺語義方面豐富，與語言模型自然兼容並且高度壓縮。憑藉筆劃標記，StrokeNUWA在向量圖形生成任務中可以明顯超越傳統基於語言模型和基於優化的方法在各種指標上的表現。此外，StrokeNUWA在推理速度上實現了高達94倍的加速，並具有出色的SVG代碼壓縮比達6.9%。

English

To leverage LLMs for visual synthesis, traditional methods convert raster image information into discrete grid tokens through specialized visual modules, while disrupting the model's ability to capture the true semantic representation of visual scenes. This paper posits that an alternative representation of images, vector graphics, can effectively surmount this limitation by enabling a more natural and semantically coherent segmentation of the image information. Thus, we introduce StrokeNUWA, a pioneering work exploring a better visual representation ''stroke tokens'' on vector graphics, which is inherently visual semantics rich, naturally compatible with LLMs, and highly compressed. Equipped with stroke tokens, StrokeNUWA can significantly surpass traditional LLM-based and optimization-based methods across various metrics in the vector graphic generation task. Besides, StrokeNUWA achieves up to a 94x speedup in inference over the speed of prior methods with an exceptional SVG code compression ratio of 6.9%.

StrokeNUWA：用於向量圖形合成的筆劃分詞

StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis

摘要

Support