ChatPaper.aiChatPaper

基因组:通过生长和重复利用模块进行生成式神经符号视觉推理

GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs

November 8, 2023
作者: Zhenfang Chen, Rui Sun, Wenjun Liu, Yining Hong, Chuang Gan
cs.AI

摘要

最近的研究表明,大型语言模型(LLMs)可以通过编程能力增强传统的神经符号模型,将语言翻译为模块描述,从而在保持模型透明度和效率的同时实现强大的视觉推理结果。然而,这些模型通常会在每个新任务实例中耗尽地生成整个代码片段,这是极其低效的。我们提出通过增长和重用模块来实现生成式神经符号视觉推理。具体而言,我们的模型包括三个独特阶段,即模块初始化、模块生成和模块执行。首先,针对一个视觉-语言任务,我们采用LLMs来检查是否可以重用和增长已建立的模块来处理这个新任务。如果不能,我们会初始化新任务所需的新模块,并指定这个新模块的输入和输出。然后,通过查询LLMs来生成相应的代码片段以满足要求来创建新模块。为了更好地了解新模块的能力,我们将少量训练示例视为测试用例,以查看我们的新模块是否能通过这些案例。如果可以,新模块将被添加到模块库中以供将来重用。最后,我们通过执行解析的程序与新制作的视觉模块来评估我们模型在测试集上的性能以获得结果。我们发现所提出的模型具有几个优点。首先,在标准任务如视觉问题回答和指代表达理解方面表现竞争力;其次,从一个任务学习的模块可以无缝地转移到新任务;最后但同样重要的是,通过观察少量训练示例和重用模块,它能够适应新的视觉推理任务。
English
Recent works have shown that Large Language Models (LLMs) could empower traditional neuro-symbolic models via programming capabilities to translate language into module descriptions, thus achieving strong visual reasoning results while maintaining the model's transparency and efficiency. However, these models usually exhaustively generate the entire code snippet given each new instance of a task, which is extremely ineffective. We propose generative neuro-symbolic visual reasoning by growing and reusing modules. Specifically, our model consists of three unique stages, module initialization, module generation, and module execution. First, given a vision-language task, we adopt LLMs to examine whether we could reuse and grow over established modules to handle this new task. If not, we initialize a new module needed by the task and specify the inputs and outputs of this new module. After that, the new module is created by querying LLMs to generate corresponding code snippets that match the requirements. In order to get a better sense of the new module's ability, we treat few-shot training examples as test cases to see if our new module could pass these cases. If yes, the new module is added to the module library for future reuse. Finally, we evaluate the performance of our model on the testing set by executing the parsed programs with the newly made visual modules to get the results. We find the proposed model possesses several advantages. First, it performs competitively on standard tasks like visual question answering and referring expression comprehension; Second, the modules learned from one task can be seamlessly transferred to new tasks; Last but not least, it is able to adapt to new visual reasoning tasks by observing a few training examples and reusing modules.
PDF110December 15, 2024