TikZero:零样本文本引导的图形程序合成
TikZero: Zero-Shot Text-Guided Graphics Program Synthesis
March 14, 2025
作者: Jonas Belouadi, Eddy Ilg, Margret Keuper, Hideki Tanaka, Masao Utiyama, Raj Dabre, Steffen Eger, Simone Paolo Ponzetto
cs.AI
摘要
随着生成式AI的兴起,从文本描述合成图形成为一个引人注目的应用。然而,要实现高几何精度和可编辑性,需要将图形表示为如TikZ等图形编程语言中的程序,而与之对齐的训练数据(即带有描述的图形程序)仍然稀缺。与此同时,大量未对齐的图形程序和带有描述的栅格图像则更为易得。我们通过提出TikZero来调和这些不同的数据源,它利用图像表示作为中介桥梁,将图形程序生成与文本理解解耦。这使得图形程序和带描述图像能够独立训练,并在推理时实现零样本的文本引导图形程序合成。我们证明,相较于仅能处理描述对齐图形程序的基线方法,我们的方法表现显著更优。此外,当利用描述对齐的图形程序作为补充训练信号时,TikZero的表现与包括GPT-4o在内的更大规模商业系统相当甚至超越。我们的代码、数据集及精选模型均已公开。
English
With the rise of generative AI, synthesizing figures from text captions
becomes a compelling application. However, achieving high geometric precision
and editability requires representing figures as graphics programs in languages
like TikZ, and aligned training data (i.e., graphics programs with captions)
remains scarce. Meanwhile, large amounts of unaligned graphics programs and
captioned raster images are more readily available. We reconcile these
disparate data sources by presenting TikZero, which decouples graphics program
generation from text understanding by using image representations as an
intermediary bridge. It enables independent training on graphics programs and
captioned images and allows for zero-shot text-guided graphics program
synthesis during inference. We show that our method substantially outperforms
baselines that can only operate with caption-aligned graphics programs.
Furthermore, when leveraging caption-aligned graphics programs as a
complementary training signal, TikZero matches or exceeds the performance of
much larger models, including commercial systems like GPT-4o. Our code,
datasets, and select models are publicly available.Summary
AI-Generated Summary