GENOME: 모듈의 성장과 재사용을 통한 생성적 신경-기호 시각 추론

초록

최근 연구들은 대규모 언어 모델(LLMs)이 프로그래밍 능력을 통해 전통적인 신경-기호 모델을 강화하여 언어를 모듈 설명으로 변환함으로써 모델의 투명성과 효율성을 유지하면서도 강력한 시각적 추론 결과를 달성할 수 있음을 보여주었습니다. 그러나 이러한 모델들은 일반적으로 각 작업의 새로운 인스턴스가 주어질 때마다 전체 코드 스니펫을 모두 생성하는데, 이는 매우 비효율적입니다. 우리는 모듈을 성장시키고 재사용함으로써 생성적 신경-기호 시각적 추론을 제안합니다. 구체적으로, 우리의 모델은 모듈 초기화, 모듈 생성, 모듈 실행이라는 세 가지 독특한 단계로 구성됩니다. 먼저, 시각-언어 작업이 주어지면, 우리는 LLMs를 사용하여 이 새로운 작업을 처리하기 위해 기존 모듈을 재사용하고 성장시킬 수 있는지 검토합니다. 만약 그렇지 않다면, 작업에 필요한 새로운 모듈을 초기화하고 이 새로운 모듈의 입력과 출력을 명시합니다. 그 후, 새로운 모듈은 요구 사항에 맞는 코드 스니펫을 생성하기 위해 LLMs를 쿼리하여 생성됩니다. 새로운 모듈의 능력을 더 잘 이해하기 위해, 우리는 소수의 학습 예제를 테스트 케이스로 간주하여 새로운 모듈이 이러한 케이스를 통과할 수 있는지 확인합니다. 만약 통과한다면, 새로운 모듈은 향후 재사용을 위해 모듈 라이브러리에 추가됩니다. 마지막으로, 우리는 새로 만든 시각적 모듈로 파싱된 프로그램을 실행하여 테스트 세트에서 모델의 성능을 평가하고 결과를 얻습니다. 우리는 제안된 모델이 몇 가지 장점을 가지고 있음을 발견했습니다. 첫째, 시각적 질의응답 및 참조 표현 이해와 같은 표준 작업에서 경쟁력 있는 성능을 보입니다. 둘째, 한 작업에서 학습된 모듈은 새로운 작업으로 원활하게 전이될 수 있습니다. 마지막으로, 소수의 학습 예제를 관찰하고 모듈을 재사용함으로써 새로운 시각적 추론 작업에 적응할 수 있습니다.

English

Recent works have shown that Large Language Models (LLMs) could empower traditional neuro-symbolic models via programming capabilities to translate language into module descriptions, thus achieving strong visual reasoning results while maintaining the model's transparency and efficiency. However, these models usually exhaustively generate the entire code snippet given each new instance of a task, which is extremely ineffective. We propose generative neuro-symbolic visual reasoning by growing and reusing modules. Specifically, our model consists of three unique stages, module initialization, module generation, and module execution. First, given a vision-language task, we adopt LLMs to examine whether we could reuse and grow over established modules to handle this new task. If not, we initialize a new module needed by the task and specify the inputs and outputs of this new module. After that, the new module is created by querying LLMs to generate corresponding code snippets that match the requirements. In order to get a better sense of the new module's ability, we treat few-shot training examples as test cases to see if our new module could pass these cases. If yes, the new module is added to the module library for future reuse. Finally, we evaluate the performance of our model on the testing set by executing the parsed programs with the newly made visual modules to get the results. We find the proposed model possesses several advantages. First, it performs competitively on standard tasks like visual question answering and referring expression comprehension; Second, the modules learned from one task can be seamlessly transferred to new tasks; Last but not least, it is able to adapt to new visual reasoning tasks by observing a few training examples and reusing modules.

GENOME: 모듈의 성장과 재사용을 통한 생성적 신경-기호 시각 추론

GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs

초록

Support