大型語言模型能理解符號圖形程式嗎？

摘要

評估大型語言模型（LLMs）的能力通常具有挑戰性，部分原因在於很難找到它們在訓練過程中未曾接觸過的任務。我們針對這一挑戰採取了一個步驟，轉向一個新任務：專注於符號圖形程序，這是一種流行的圖形內容表示形式，可以程序生成視覺數據。LLMs在程序合成方面展示了令人振奮的潛力，但它們是否理解符號圖形程序呢？與傳統程序不同，符號圖形程序可以轉換為圖形內容。在這裡，我們通過評估LLMs對與圖形內容相關問題的回答能力，來表徵LLMs對符號程序的理解。這個任務具有挑戰性，因為從單純的符號程序中回答問題很困難，但從對應的圖形內容中回答則較容易，這一點我們通過人類實驗得以驗證。為了理解符號程序，LLMs可能需要擁有想像對應的圖形內容會是什麼樣子的能力，而不是直接訪問渲染的視覺內容。我們利用這個任務來評估LLMs，通過建立一個用於評估符號圖形程序語義理解的大型基準。這個基準是通過程序-圖形對應構建的，因此需要最少的人力。我們在我們的基準上評估當前的LLMs，以闡明它們從程序中推理視覺場景的能力的初步評估。我們發現這個任務可以區分現有的LLMs，並且被認為在推理方面表現良好的模型效果更好。最後，我們介紹了符號指令調整（SIT）來改善這種能力。具體來說，我們通過符號程序生成的問題和圖像對GPT4-o進行查詢。然後使用這些數據來微調LLM。我們還發現，SIT數據可以提高LLMs的一般指令遵循能力。

English

Assessing the capabilities of large language models (LLMs) is often challenging, in part, because it is hard to find tasks to which they have not been exposed during training. We take one step to address this challenge by turning to a new task: focusing on symbolic graphics programs, which are a popular representation for graphics content that procedurally generates visual data. LLMs have shown exciting promise towards program synthesis, but do they understand symbolic graphics programs? Unlike conventional programs, symbolic graphics programs can be translated to graphics content. Here, we characterize an LLM's understanding of symbolic programs in terms of their ability to answer questions related to the graphics content. This task is challenging as the questions are difficult to answer from the symbolic programs alone -- yet, they would be easy to answer from the corresponding graphics content as we verify through a human experiment. To understand symbolic programs, LLMs may need to possess the ability to imagine how the corresponding graphics content would look without directly accessing the rendered visual content. We use this task to evaluate LLMs by creating a large benchmark for the semantic understanding of symbolic graphics programs. This benchmark is built via program-graphics correspondence, hence requiring minimal human efforts. We evaluate current LLMs on our benchmark to elucidate a preliminary assessment of their ability to reason about visual scenes from programs. We find that this task distinguishes existing LLMs and models considered good at reasoning perform better. Lastly, we introduce Symbolic Instruction Tuning (SIT) to improve this ability. Specifically, we query GPT4-o with questions and images generated by symbolic programs. Such data are then used to finetune an LLM. We also find that SIT data can improve the general instruction following ability of LLMs.

大型語言模型能理解符號圖形程式嗎？

Can Large Language Models Understand Symbolic Graphics Programs?

摘要

Support