StrokeNUWA: ベクターグラフィック合成のためのストロークトークン化

要旨

大規模言語モデル（LLM）を視覚合成に活用するため、従来の手法では専門的な視覚モジュールを通じてラスター画像情報を離散的なグリッドトークンに変換していましたが、これによりモデルが視覚シーンの真の意味表現を捉える能力が阻害されていました。本論文では、画像の代替表現であるベクターグラフィックスが、より自然で意味的に一貫した画像情報の分割を可能にすることで、この制限を効果的に克服できると提唱します。そこで我々は、ベクターグラフィックス上でより優れた視覚表現「ストロークトークン」を探求する先駆的な研究であるStrokeNUWAを紹介します。このストロークトークンは、本質的に視覚的意味が豊富で、LLMと自然に互換性があり、高度に圧縮されています。ストロークトークンを備えたStrokeNUWAは、ベクターグラフィック生成タスクにおいて、従来のLLMベースおよび最適化ベースの手法を様々な指標で大幅に上回ることができます。さらに、StrokeNUWAは推論速度において従来手法に比べて最大94倍の高速化を実現し、6.9%という卓越したSVGコード圧縮率を達成しています。

English

To leverage LLMs for visual synthesis, traditional methods convert raster image information into discrete grid tokens through specialized visual modules, while disrupting the model's ability to capture the true semantic representation of visual scenes. This paper posits that an alternative representation of images, vector graphics, can effectively surmount this limitation by enabling a more natural and semantically coherent segmentation of the image information. Thus, we introduce StrokeNUWA, a pioneering work exploring a better visual representation ''stroke tokens'' on vector graphics, which is inherently visual semantics rich, naturally compatible with LLMs, and highly compressed. Equipped with stroke tokens, StrokeNUWA can significantly surpass traditional LLM-based and optimization-based methods across various metrics in the vector graphic generation task. Besides, StrokeNUWA achieves up to a 94x speedup in inference over the speed of prior methods with an exceptional SVG code compression ratio of 6.9%.

StrokeNUWA: ベクターグラフィック合成のためのストロークトークン化

StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis

要旨

Support