보는 언어 모델을 향하여: 자연어의 렌즈를 통한 컴퓨터 비전

초록

우리는 대규모 언어 모델(LLMs)의 힘을 활용하여 컴퓨터 비전 문제를 해결하기 위한 모듈식 접근법인 LENS를 제안한다. 본 시스템은 이미지에 대한 포괄적인 정보를 제공하는 독립적이고 매우 설명적인 비전 모듈 세트의 출력을 언어 모델을 통해 추론한다. 우리는 이 접근법을 제로샷 및 퓨샷 객체 인식과 같은 순수 컴퓨터 비전 설정뿐만 아니라 비전과 언어 문제에서도 평가한다. LENS는 기성 LLM에 적용할 수 있으며, LENS를 적용한 LLM은 훨씬 더 크고 정교한 시스템과 매우 경쟁력 있게 성능을 발휘함을 확인했다. 이는 어떠한 다중모달 학습도 없이 이루어진다. 우리는 코드를 https://github.com/ContextualAI/lens에서 오픈소스로 공개하고, 인터랙티브 데모를 제공한다.

English

We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo.

보는 언어 모델을 향하여: 자연어의 렌즈를 통한 컴퓨터 비전

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

초록

Support