視覚を持つ言語モデルへ：自然言語のレンズを通したコンピュータビジョン

要旨

我々は、大規模言語モデル（LLM）の力を活用してコンピュータビジョンの問題に取り組むためのモジュール型アプローチ「LENS」を提案する。本システムは、画像に関する網羅的な情報を提供する独立した高度に記述的なビジョンモジュール群の出力に対して、言語モデルを用いて推論を行う。このアプローチを、ゼロショットおよび少数ショットの物体認識といった純粋なコンピュータビジョンの設定や、視覚と言語の問題に対して評価する。LENSは既存の任意のLLMに適用可能であり、LENSを組み込んだLLMは、はるかに大規模で洗練されたシステムと比べても非常に競争力のある性能を発揮することを確認した。しかも、マルチモーダルなトレーニングを一切行わずにこれを実現している。我々はコードをhttps://github.com/ContextualAI/lensでオープンソースとして公開し、インタラクティブなデモも提供している。

English

We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo.

視覚を持つ言語モデルへ：自然言語のレンズを通したコンピュータビジョン

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

要旨

Support