TikZero: ゼロショットテキストガイドによるグラフィックスプログラム合成

要旨

生成AIの台頭に伴い、テキストキャプションから図形を合成することが注目を集めるアプリケーションとなっています。しかし、高い幾何学的精度と編集性を実現するためには、TikZのような言語で図形をグラフィックスプログラムとして表現する必要があり、整列したトレーニングデータ（つまり、キャプション付きのグラフィックスプログラム）は依然として不足しています。一方で、整列していない大量のグラフィックスプログラムとキャプション付きラスター画像はより容易に入手可能です。私たちは、これらの異なるデータソースを統合するために、TikZeroを提案します。TikZeroは、画像表現を中間ブリッジとして利用することで、グラフィックスプログラムの生成をテキスト理解から分離します。これにより、グラフィックスプログラムとキャプション付き画像を独立してトレーニングすることが可能になり、推論時にゼロショットでテキストガイドによるグラフィックスプログラムの合成が可能になります。私たちの手法は、キャプションと整列したグラフィックスプログラムのみを扱えるベースラインを大幅に上回ることを示します。さらに、キャプションと整列したグラフィックスプログラムを補助的なトレーニング信号として活用する場合、TikZeroはGPT-4oのような商用システムを含む、はるかに大規模なモデルの性能に匹敵またはそれを上回ります。私たちのコード、データセット、および選択されたモデルは公開されています。

English

With the rise of generative AI, synthesizing figures from text captions becomes a compelling application. However, achieving high geometric precision and editability requires representing figures as graphics programs in languages like TikZ, and aligned training data (i.e., graphics programs with captions) remains scarce. Meanwhile, large amounts of unaligned graphics programs and captioned raster images are more readily available. We reconcile these disparate data sources by presenting TikZero, which decouples graphics program generation from text understanding by using image representations as an intermediary bridge. It enables independent training on graphics programs and captioned images and allows for zero-shot text-guided graphics program synthesis during inference. We show that our method substantially outperforms baselines that can only operate with caption-aligned graphics programs. Furthermore, when leveraging caption-aligned graphics programs as a complementary training signal, TikZero matches or exceeds the performance of much larger models, including commercial systems like GPT-4o. Our code, datasets, and select models are publicly available.