大型语言模型能理解符号图形程序吗？

摘要

评估大型语言模型（LLMs）的能力通常具有挑战性，部分原因在于很难找到它们在训练过程中未接触过的任务。我们为了应对这一挑战迈出了一步，转向一个新任务：专注于符号图形程序，这是一种流行的图形内容表示形式，可以以过程方式生成视觉数据。LLMs在程序合成方面显示出令人兴奋的潜力，但它们是否理解符号图形程序呢？与传统程序不同，符号图形程序可以转换为图形内容。在这里，我们通过LLMs回答与图形内容相关的问题的能力来表征它们对符号程序的理解。这个任务具有挑战性，因为仅通过符号程序很难回答这些问题，但通过人类实验验证，从相应的图形内容回答这些问题将会很容易。为了理解符号程序，LLMs可能需要具备想象对应图形内容的能力，而无需直接访问渲染的视觉内容。我们利用这个任务来评估LLMs，创建了一个用于语义理解符号图形程序的大型基准。这个基准是通过程序-图形对应构建的，因此需要较少的人力。我们在我们的基准上评估当前的LLMs，以阐明它们从程序推理视觉场景的能力的初步评估。我们发现这个任务可以区分现有的LLMs和被认为在推理方面表现良好的模型。最后，我们引入了符号指令调整（SIT）来改善这种能力。具体来说，我们使用符号程序生成的问题和图像查询GPT4-o。这些数据然后用于微调LLMs。我们还发现SIT数据可以提高LLMs的一般指令遵循能力。

English

Assessing the capabilities of large language models (LLMs) is often challenging, in part, because it is hard to find tasks to which they have not been exposed during training. We take one step to address this challenge by turning to a new task: focusing on symbolic graphics programs, which are a popular representation for graphics content that procedurally generates visual data. LLMs have shown exciting promise towards program synthesis, but do they understand symbolic graphics programs? Unlike conventional programs, symbolic graphics programs can be translated to graphics content. Here, we characterize an LLM's understanding of symbolic programs in terms of their ability to answer questions related to the graphics content. This task is challenging as the questions are difficult to answer from the symbolic programs alone -- yet, they would be easy to answer from the corresponding graphics content as we verify through a human experiment. To understand symbolic programs, LLMs may need to possess the ability to imagine how the corresponding graphics content would look without directly accessing the rendered visual content. We use this task to evaluate LLMs by creating a large benchmark for the semantic understanding of symbolic graphics programs. This benchmark is built via program-graphics correspondence, hence requiring minimal human efforts. We evaluate current LLMs on our benchmark to elucidate a preliminary assessment of their ability to reason about visual scenes from programs. We find that this task distinguishes existing LLMs and models considered good at reasoning perform better. Lastly, we introduce Symbolic Instruction Tuning (SIT) to improve this ability. Specifically, we query GPT4-o with questions and images generated by symbolic programs. Such data are then used to finetune an LLM. We also find that SIT data can improve the general instruction following ability of LLMs.

大型语言模型能理解符号图形程序吗？

Can Large Language Models Understand Symbolic Graphics Programs?

摘要

Support