StrokeNUWA: 벡터 그래픽 생성을 위한 스트로크 토큰화

초록

LLM을 시각적 합성에 활용하기 위해 기존 방법들은 특수화된 시각적 모듈을 통해 래스터 이미지 정보를 이산적인 그리드 토큰으로 변환하지만, 이는 모델이 시각적 장면의 진정한 의미 표현을 포착하는 능력을 저해합니다. 본 논문은 이미지의 대안적 표현인 벡터 그래픽이 이러한 한계를 효과적으로 극복할 수 있으며, 이미지 정보를 보다 자연스럽고 의미적으로 일관되게 분할할 수 있다고 주장합니다. 따라서 우리는 벡터 그래픽 상에서 더 나은 시각적 표현인 '스트로크 토큰'을 탐구한 선구적인 작업인 StrokeNUWA를 소개합니다. 이는 본질적으로 시각적 의미가 풍부하고, LLM과 자연스럽게 호환되며, 높은 압축률을 자랑합니다. 스트로크 토큰을 활용한 StrokeNUWA는 벡터 그래픽 생성 작업에서 기존의 LLM 기반 및 최적화 기반 방법들을 다양한 지표에서 크게 능가할 수 있습니다. 또한, StrokeNUWA는 이전 방법들 대비 최대 94배의 추론 속도 향상을 달성하며, 6.9%의 탁월한 SVG 코드 압축률을 보여줍니다.

English

To leverage LLMs for visual synthesis, traditional methods convert raster image information into discrete grid tokens through specialized visual modules, while disrupting the model's ability to capture the true semantic representation of visual scenes. This paper posits that an alternative representation of images, vector graphics, can effectively surmount this limitation by enabling a more natural and semantically coherent segmentation of the image information. Thus, we introduce StrokeNUWA, a pioneering work exploring a better visual representation ''stroke tokens'' on vector graphics, which is inherently visual semantics rich, naturally compatible with LLMs, and highly compressed. Equipped with stroke tokens, StrokeNUWA can significantly surpass traditional LLM-based and optimization-based methods across various metrics in the vector graphic generation task. Besides, StrokeNUWA achieves up to a 94x speedup in inference over the speed of prior methods with an exceptional SVG code compression ratio of 6.9%.

StrokeNUWA: 벡터 그래픽 생성을 위한 스트로크 토큰화

StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis

초록

Support