TikZero：零样本文本引导的图形程序合成

摘要

随着生成式AI的兴起，从文本描述合成图形成为一个引人注目的应用。然而，要实现高几何精度和可编辑性，需要将图形表示为如TikZ等图形编程语言中的程序，而与之对齐的训练数据（即带有描述的图形程序）仍然稀缺。与此同时，大量未对齐的图形程序和带有描述的栅格图像则更为易得。我们通过提出TikZero来调和这些不同的数据源，它利用图像表示作为中介桥梁，将图形程序生成与文本理解解耦。这使得图形程序和带描述图像能够独立训练，并在推理时实现零样本的文本引导图形程序合成。我们证明，相较于仅能处理描述对齐图形程序的基线方法，我们的方法表现显著更优。此外，当利用描述对齐的图形程序作为补充训练信号时，TikZero的表现与包括GPT-4o在内的更大规模商业系统相当甚至超越。我们的代码、数据集及精选模型均已公开。

English

With the rise of generative AI, synthesizing figures from text captions becomes a compelling application. However, achieving high geometric precision and editability requires representing figures as graphics programs in languages like TikZ, and aligned training data (i.e., graphics programs with captions) remains scarce. Meanwhile, large amounts of unaligned graphics programs and captioned raster images are more readily available. We reconcile these disparate data sources by presenting TikZero, which decouples graphics program generation from text understanding by using image representations as an intermediary bridge. It enables independent training on graphics programs and captioned images and allows for zero-shot text-guided graphics program synthesis during inference. We show that our method substantially outperforms baselines that can only operate with caption-aligned graphics programs. Furthermore, when leveraging caption-aligned graphics programs as a complementary training signal, TikZero matches or exceeds the performance of much larger models, including commercial systems like GPT-4o. Our code, datasets, and select models are publicly available.