대규모 언어 모델이 상징적 그래픽 프로그램을 이해할 수 있을까요?

초록

대형 언어 모델(Large Language Models, LLMs)의 능력을 평가하는 것은 종종 어려운데, 그 이유 중 하나는 훈련 중에 노출되지 않은 작업을 찾기가 어렵기 때문입니다. 이러한 도전에 대처하기 위해 새로운 작업에 주목하여 한 발짝 나아갑니다. 이번에는 심볼릭 그래픽 프로그램에 초점을 맞추는데, 이는 시각적 데이터를 절차적으로 생성하는 그래픽 콘텐츠의 인기 있는 표현입니다. LLMs는 프로그램 합성에 대한 흥미로운 가능성을 보여주었지만, 그들은 심볼릭 그래픽 프로그램을 이해할 수 있을까요? 일반적인 프로그램과 달리, 심볼릭 그래픽 프로그램은 그래픽 콘텐츠로 변환될 수 있습니다. 여기서 우리는 LLM이 심볼 프로그램을 이해하는 능력을 그래픽 콘텐츠와 관련된 질문에 대답하는 능력으로 특성화합니다. 이 작업은 질문이 심볼릭 프로그램만으로는 어렵게 답할 수 있기 때문에 어려운데, 대조적으로 해당 그래픽 콘텐츠로부터는 쉽게 답할 수 있음을 인간 실험을 통해 확인합니다. 심볼 프로그램을 이해하기 위해서는 LLMs가 렌더링된 시각적 콘텐츠에 직접 액세스하지 않고도 해당 그래픽 콘텐츠가 어떻게 보일지 상상할 수 있는 능력이 필요할 수 있습니다. 우리는 이 작업을 통해 심볼릭 그래픽 프로그램의 의미 이해에 대한 대규모 벤치마크를 구축하여 LLMs를 평가합니다. 이 벤치마크는 프로그램-그래픽 대응을 통해 구축되어 최소한의 인간 노력이 필요합니다. 우리는 현재의 LLMs를 이 벤치마크에서 평가하여 프로그램으로부터 시각적 장면에 대해 추론하는 능력에 대한 예비 평가를 명확하게 합니다. 이 작업은 기존 LLMs와 추론 능력이 우수하다고 여겨지는 모델들을 구분하는 데 도움이 됩니다. 마지막으로, 이 능력을 향상시키기 위해 Symbolic Instruction Tuning (SIT)를 소개합니다. 구체적으로, 우리는 질문과 심볼릭 프로그램에 의해 생성된 이미지로 GPT4-o에 쿼리를 수행합니다. 이러한 데이터는 LLM을 세밀하게 조정하는 데 사용됩니다. 또한 SIT 데이터가 LLMs의 일반적인 지시 따르기 능력을 향상시킬 수 있다는 것을 발견합니다.

English

Assessing the capabilities of large language models (LLMs) is often challenging, in part, because it is hard to find tasks to which they have not been exposed during training. We take one step to address this challenge by turning to a new task: focusing on symbolic graphics programs, which are a popular representation for graphics content that procedurally generates visual data. LLMs have shown exciting promise towards program synthesis, but do they understand symbolic graphics programs? Unlike conventional programs, symbolic graphics programs can be translated to graphics content. Here, we characterize an LLM's understanding of symbolic programs in terms of their ability to answer questions related to the graphics content. This task is challenging as the questions are difficult to answer from the symbolic programs alone -- yet, they would be easy to answer from the corresponding graphics content as we verify through a human experiment. To understand symbolic programs, LLMs may need to possess the ability to imagine how the corresponding graphics content would look without directly accessing the rendered visual content. We use this task to evaluate LLMs by creating a large benchmark for the semantic understanding of symbolic graphics programs. This benchmark is built via program-graphics correspondence, hence requiring minimal human efforts. We evaluate current LLMs on our benchmark to elucidate a preliminary assessment of their ability to reason about visual scenes from programs. We find that this task distinguishes existing LLMs and models considered good at reasoning perform better. Lastly, we introduce Symbolic Instruction Tuning (SIT) to improve this ability. Specifically, we query GPT4-o with questions and images generated by symbolic programs. Such data are then used to finetune an LLM. We also find that SIT data can improve the general instruction following ability of LLMs.

대규모 언어 모델이 상징적 그래픽 프로그램을 이해할 수 있을까요?

Can Large Language Models Understand Symbolic Graphics Programs?

초록

Support