大規模言語モデルは記号的グラフィックスプログラムを理解できるか？

要旨

大規模言語モデル（LLM）の能力を評価することは、しばしば困難を伴います。その理由の一つは、トレーニング中に曝露されていないタスクを見つけることが難しいためです。この課題に対処するため、我々は新しいタスクに注目します：シンボリックグラフィックスプログラムに焦点を当てます。これは、視覚データを手続き的に生成するグラフィックスコンテンツの一般的な表現です。LLMはプログラム合成において有望な成果を示していますが、シンボリックグラフィックスプログラムを理解しているのでしょうか？従来のプログラムとは異なり、シンボリックグラフィックスプログラムはグラフィックスコンテンツに変換可能です。ここでは、LLMのシンボリックプログラムの理解度を、グラフィックスコンテンツに関連する質問に答える能力として特徴付けます。このタスクは、質問がシンボリックプログラムだけから答えることが難しいため、挑戦的です。しかし、対応するグラフィックスコンテンツからは容易に答えられることが、人間による実験で確認されています。シンボリックプログラムを理解するためには、LLMはレンダリングされた視覚コンテンツに直接アクセスすることなく、対応するグラフィックスコンテンツがどのように見えるかを想像する能力が必要かもしれません。我々はこのタスクを用いて、シンボリックグラフィックスプログラムの意味理解に関する大規模なベンチマークを作成し、LLMを評価します。このベンチマークはプログラムとグラフィックスの対応関係に基づいて構築されているため、人間の労力を最小限に抑えています。我々は現在のLLMをこのベンチマークで評価し、プログラムから視覚シーンを推論する能力についての予備的な評価を行います。このタスクは既存のLLMを区別し、推論能力が高いとされるモデルがより良いパフォーマンスを示すことがわかりました。最後に、この能力を向上させるために、シンボリック命令チューニング（SIT）を導入します。具体的には、シンボリックプログラムによって生成された質問と画像を用いてGPT4-oに問い合わせます。そのようなデータは、LLMのファインチューニングに使用されます。また、SITデータがLLMの一般的な命令追従能力を向上させることもわかりました。

English

Assessing the capabilities of large language models (LLMs) is often challenging, in part, because it is hard to find tasks to which they have not been exposed during training. We take one step to address this challenge by turning to a new task: focusing on symbolic graphics programs, which are a popular representation for graphics content that procedurally generates visual data. LLMs have shown exciting promise towards program synthesis, but do they understand symbolic graphics programs? Unlike conventional programs, symbolic graphics programs can be translated to graphics content. Here, we characterize an LLM's understanding of symbolic programs in terms of their ability to answer questions related to the graphics content. This task is challenging as the questions are difficult to answer from the symbolic programs alone -- yet, they would be easy to answer from the corresponding graphics content as we verify through a human experiment. To understand symbolic programs, LLMs may need to possess the ability to imagine how the corresponding graphics content would look without directly accessing the rendered visual content. We use this task to evaluate LLMs by creating a large benchmark for the semantic understanding of symbolic graphics programs. This benchmark is built via program-graphics correspondence, hence requiring minimal human efforts. We evaluate current LLMs on our benchmark to elucidate a preliminary assessment of their ability to reason about visual scenes from programs. We find that this task distinguishes existing LLMs and models considered good at reasoning perform better. Lastly, we introduce Symbolic Instruction Tuning (SIT) to improve this ability. Specifically, we query GPT4-o with questions and images generated by symbolic programs. Such data are then used to finetune an LLM. We also find that SIT data can improve the general instruction following ability of LLMs.

大規模言語モデルは記号的グラフィックスプログラムを理解できるか？

Can Large Language Models Understand Symbolic Graphics Programs?

要旨

Support