邁向能夠「看見」的語言模型：透過自然語言的「鏡頭」來理解計算機視覺

摘要

我們提出了LENS，透過利用大型語言模型（LLMs）的強大功能，提出了一種模塊化方法來應對計算機視覺問題。我們的系統使用語言模型來推理一組獨立且高度描述性的視覺模塊的輸出，這些模塊提供有關圖像的詳盡信息。我們在純計算機視覺設置中評估了這種方法，如零次和少次物體識別，以及視覺和語言問題。LENS可以應用於任何現成的LLM，我們發現具有LENS的LLMs表現出色，即使沒有進行任何多模態訓練，也能與更大更複雜的系統競爭。我們在https://github.com/ContextualAI/lens 開源我們的代碼並提供互動演示。

English

We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo.

邁向能夠「看見」的語言模型：透過自然語言的「鏡頭」來理解計算機視覺

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

摘要

Support