Verso Modelli Linguistici che Possono Vedere: Visione Artificiale Attraverso la LENTE del Linguaggio Naturale

Abstract

Proponiamo LENS, un approccio modulare per affrontare problemi di computer vision sfruttando la potenza dei grandi modelli linguistici (LLM). Il nostro sistema utilizza un modello linguistico per ragionare sugli output di un insieme di moduli visivi indipendenti e altamente descrittivi che forniscono informazioni esaustive su un'immagine. Valutiamo l'approccio in contesti di pura computer vision come il riconoscimento di oggetti in zero-shot e few-shot, nonché su problemi che combinano visione e linguaggio. LENS può essere applicato a qualsiasi LLM preesistente e scopriamo che i LLM con LENS ottengono prestazioni altamente competitive rispetto a sistemi molto più grandi e sofisticati, senza alcun addestramento multimodale. Rendiamo disponibile il nostro codice open-source all'indirizzo https://github.com/ContextualAI/lens e forniamo una demo interattiva.

English

We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo.

Verso Modelli Linguistici che Possono Vedere: Visione Artificiale Attraverso la LENTE del Linguaggio Naturale

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

Abstract

Support