基因組:通過增長和重複使用模塊進行生成式神經符號視覺推理
GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs
November 8, 2023
作者: Zhenfang Chen, Rui Sun, Wenjun Liu, Yining Hong, Chuang Gan
cs.AI
摘要
最近的研究表明,大型語言模型(LLMs)可以通過編程能力加強傳統的神經符號模型,將語言翻譯為模塊描述,從而實現強大的視覺推理結果,同時保持模型的透明度和效率。然而,這些模型通常會在每個新任務實例中耗盡生成整個代碼片段,這是非常低效的。我們提出了生成式神經符號視覺推理方法,通過擴展和重複使用模塊。具體而言,我們的模型包括三個獨特階段,即模塊初始化、模塊生成和模塊執行。首先,對於一個視覺語言任務,我們採用LLMs來檢查我們是否可以重用和擴展已建立的模塊來處理這個新任務。如果不能,我們將初始化任務所需的新模塊,並指定這個新模塊的輸入和輸出。之後,通過查詢LLMs來生成符合要求的相應代碼片段,創建新模塊。為了更好地了解新模塊的能力,我們將少量訓練示例視為測試用例,以查看我們的新模塊是否能通過這些用例。如果是,則將新模塊添加到模塊庫以供將來重用。最後,我們通過執行解析程序並使用新創建的視覺模塊來獲取結果,評估我們模型在測試集上的性能。我們發現所提出的模型具有幾個優勢。首先,在視覺問答和指稱表達理解等標準任務上表現出色;其次,從一個任務中學習的模塊可以無縫轉移到新任務;最後,通過觀察少量訓練示例並重複使用模塊,它能夠適應新的視覺推理任務。
English
Recent works have shown that Large Language Models (LLMs) could empower
traditional neuro-symbolic models via programming capabilities to translate
language into module descriptions, thus achieving strong visual reasoning
results while maintaining the model's transparency and efficiency. However,
these models usually exhaustively generate the entire code snippet given each
new instance of a task, which is extremely ineffective. We propose generative
neuro-symbolic visual reasoning by growing and reusing modules. Specifically,
our model consists of three unique stages, module initialization, module
generation, and module execution. First, given a vision-language task, we adopt
LLMs to examine whether we could reuse and grow over established modules to
handle this new task. If not, we initialize a new module needed by the task and
specify the inputs and outputs of this new module. After that, the new module
is created by querying LLMs to generate corresponding code snippets that match
the requirements. In order to get a better sense of the new module's ability,
we treat few-shot training examples as test cases to see if our new module
could pass these cases. If yes, the new module is added to the module library
for future reuse. Finally, we evaluate the performance of our model on the
testing set by executing the parsed programs with the newly made visual modules
to get the results. We find the proposed model possesses several advantages.
First, it performs competitively on standard tasks like visual question
answering and referring expression comprehension; Second, the modules learned
from one task can be seamlessly transferred to new tasks; Last but not least,
it is able to adapt to new visual reasoning tasks by observing a few training
examples and reusing modules.