TikZero:零樣本文本引導的圖形程序合成
TikZero: Zero-Shot Text-Guided Graphics Program Synthesis
March 14, 2025
作者: Jonas Belouadi, Eddy Ilg, Margret Keuper, Hideki Tanaka, Masao Utiyama, Raj Dabre, Steffen Eger, Simone Paolo Ponzetto
cs.AI
摘要
隨著生成式人工智慧的興起,從文字描述合成圖形成為一項引人注目的應用。然而,要實現高幾何精度和可編輯性,需要將圖形表示為如TikZ等圖形程式語言,而對齊的訓練數據(即帶有文字描述的圖形程式)仍然稀缺。與此同時,大量未對齊的圖形程式和帶有文字描述的點陣圖像更易獲得。我們通過提出TikZero來協調這些不同的數據源,它利用圖像表示作為中介橋樑,將圖形程式生成與文本理解解耦。這使得能夠獨立訓練圖形程式和帶有文字描述的圖像,並在推理過程中實現零樣本文本引導的圖形程式合成。我們展示了我們的方法顯著優於僅能使用對齊圖形程式的基線模型。此外,當利用對齊圖形程式作為補充訓練信號時,TikZero的性能與甚至超過了包括GPT-4o在內的更大規模模型,包括商業系統。我們的代碼、數據集和部分模型已公開提供。
English
With the rise of generative AI, synthesizing figures from text captions
becomes a compelling application. However, achieving high geometric precision
and editability requires representing figures as graphics programs in languages
like TikZ, and aligned training data (i.e., graphics programs with captions)
remains scarce. Meanwhile, large amounts of unaligned graphics programs and
captioned raster images are more readily available. We reconcile these
disparate data sources by presenting TikZero, which decouples graphics program
generation from text understanding by using image representations as an
intermediary bridge. It enables independent training on graphics programs and
captioned images and allows for zero-shot text-guided graphics program
synthesis during inference. We show that our method substantially outperforms
baselines that can only operate with caption-aligned graphics programs.
Furthermore, when leveraging caption-aligned graphics programs as a
complementary training signal, TikZero matches or exceeds the performance of
much larger models, including commercial systems like GPT-4o. Our code,
datasets, and select models are publicly available.Summary
AI-Generated Summary