GENOME: モジュールの成長と再利用による生成的ニューロシンボリック視覚推論

要旨

近年の研究では、大規模言語モデル（LLMs）が、従来のニューロシンボリックモデルをプログラミング能力によって強化し、言語をモジュール記述に変換することで、モデルの透明性と効率性を維持しながら強力な視覚的推論結果を達成できることが示されています。しかし、これらのモデルは通常、各タスクの新しいインスタンスに対してコードスニペット全体を網羅的に生成するため、非常に非効率的です。本研究では、モジュールの成長と再利用による生成的ニューロシンボリック視覚推論を提案します。具体的には、我々のモデルは、モジュール初期化、モジュール生成、モジュール実行という3つの独自の段階で構成されています。まず、視覚言語タスクが与えられた場合、LLMsを採用して、既存のモジュールを再利用および成長させてこの新しいタスクを処理できるかどうかを検討します。できない場合、タスクに必要な新しいモジュールを初期化し、この新しいモジュールの入力と出力を指定します。その後、LLMsにクエリを送信して、要件に合致する対応するコードスニペットを生成することで、新しいモジュールを作成します。新しいモジュールの能力をよりよく理解するために、少数のトレーニング例をテストケースとして扱い、新しいモジュールがこれらのケースを通過できるかどうかを確認します。通過できれば、新しいモジュールは将来の再利用のためにモジュールライブラリに追加されます。最後に、新しく作成された視覚モジュールを使用して解析されたプログラムを実行し、結果を得ることで、テストセットに対するモデルの性能を評価します。提案されたモデルにはいくつかの利点があることがわかりました。第一に、視覚的質問応答や参照表現理解などの標準タスクで競争力のある性能を発揮します。第二に、あるタスクから学習したモジュールを新しいタスクにシームレスに転送できます。最後に、少数のトレーニング例を観察し、モジュールを再利用することで、新しい視覚的推論タスクに適応できることです。

English

Recent works have shown that Large Language Models (LLMs) could empower traditional neuro-symbolic models via programming capabilities to translate language into module descriptions, thus achieving strong visual reasoning results while maintaining the model's transparency and efficiency. However, these models usually exhaustively generate the entire code snippet given each new instance of a task, which is extremely ineffective. We propose generative neuro-symbolic visual reasoning by growing and reusing modules. Specifically, our model consists of three unique stages, module initialization, module generation, and module execution. First, given a vision-language task, we adopt LLMs to examine whether we could reuse and grow over established modules to handle this new task. If not, we initialize a new module needed by the task and specify the inputs and outputs of this new module. After that, the new module is created by querying LLMs to generate corresponding code snippets that match the requirements. In order to get a better sense of the new module's ability, we treat few-shot training examples as test cases to see if our new module could pass these cases. If yes, the new module is added to the module library for future reuse. Finally, we evaluate the performance of our model on the testing set by executing the parsed programs with the newly made visual modules to get the results. We find the proposed model possesses several advantages. First, it performs competitively on standard tasks like visual question answering and referring expression comprehension; Second, the modules learned from one task can be seamlessly transferred to new tasks; Last but not least, it is able to adapt to new visual reasoning tasks by observing a few training examples and reusing modules.

GENOME: モジュールの成長と再利用による生成的ニューロシンボリック視覚推論

GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs

要旨

Support