走向能够视觉的语言模型：通过自然语言的LENS进行计算机视觉

摘要

我们提出了LENS，这是一种模块化方法，通过利用大型语言模型（LLMs）的强大能力来解决计算机视觉问题。我们的系统使用语言模型对一组独立且高度描述性的视觉模块的输出进行推理，这些模块提供有关图像的详尽信息。我们在纯计算机视觉设置下评估了这种方法，如零样本和少样本目标识别，以及视觉和语言问题。LENS可以应用于任何现成的LLM，我们发现具有LENS的LLMs表现非常竞争力，甚至比更大更复杂的系统表现更好，而且完全没有进行多模态训练。我们在https://github.com/ContextualAI/lens 开源了我们的代码，并提供了一个交互式演示。

English

We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo.

走向能够视觉的语言模型：通过自然语言的LENS进行计算机视觉

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

摘要

Support