VisionLLaMA:一個統一的LLaMA界面,用於視覺任務
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
March 1, 2024
作者: Xiangxiang Chu, Jianlin Su, Bo Zhang, Chunhua Shen
cs.AI
摘要
大型語言模型建立在基於Transformer架構的基礎上,用於處理文本輸入。例如,LLaMA在眾多開源實現中脫穎而出。同一個Transformer能否用於處理2D圖像?本文通過揭示一種類似LLaMA的視覺Transformer,以純粹和金字塔形式呈現,名為VisionLLaMA,來回答這個問題,並且特別為此目的量身定制。VisionLLaMA是一個統一且通用的建模框架,用於解決大多數視覺任務。我們通過在圖像感知和特別是圖像生成的許多下游任務中廣泛評估其有效性。在許多情況下,VisionLLaMA展示出明顯優於先前最先進的視覺Transformer的收益。我們相信VisionLLaMA可以作為視覺生成和理解的強大新基準模型。我們的代碼將在https://github.com/Meituan-AutoML/VisionLLaMA上發布。
English
Large language models are built on top of a transformer-based architecture to
process textual inputs. For example, the LLaMA stands out among many
open-source implementations. Can the same transformer be used to process 2D
images? In this paper, we answer this question by unveiling a LLaMA-like vision
transformer in plain and pyramid forms, termed VisionLLaMA, which is tailored
for this purpose. VisionLLaMA is a unified and generic modelling framework for
solving most vision tasks. We extensively evaluate its effectiveness using
typical pre-training paradigms in a good portion of downstream tasks of image
perception and especially image generation. In many cases, VisionLLaMA have
exhibited substantial gains over the previous state-of-the-art vision
transformers. We believe that VisionLLaMA can serve as a strong new baseline
model for vision generation and understanding. Our code will be released at
https://github.com/Meituan-AutoML/VisionLLaMA.