VisionLLaMA:用于视觉任务的统一LLaMA接口
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
March 1, 2024
作者: Xiangxiang Chu, Jianlin Su, Bo Zhang, Chunhua Shen
cs.AI
摘要
大型语言模型是建立在基于Transformer的架构之上,用于处理文本输入。例如,在许多开源实现中,LLaMA脱颖而出。同一个Transformer能否用于处理二维图像?本文通过揭示一种类似LLaMA的视觉Transformer,即Plain形式和Pyramid形式的VisionLLaMA,来回答这个问题,该模型专为此目的而设计。VisionLLaMA是一个统一且通用的建模框架,用于解决大多数视觉任务。我们通过在图像感知和特别是图像生成的许多下游任务中广泛评估其有效性。在许多情况下,VisionLLaMA相较于先前最先进的视觉Transformer取得了显著的提升。我们相信VisionLLaMA可以作为视觉生成和理解的强大新基准模型。我们的代码将在https://github.com/Meituan-AutoML/VisionLLaMA 上发布。
English
Large language models are built on top of a transformer-based architecture to
process textual inputs. For example, the LLaMA stands out among many
open-source implementations. Can the same transformer be used to process 2D
images? In this paper, we answer this question by unveiling a LLaMA-like vision
transformer in plain and pyramid forms, termed VisionLLaMA, which is tailored
for this purpose. VisionLLaMA is a unified and generic modelling framework for
solving most vision tasks. We extensively evaluate its effectiveness using
typical pre-training paradigms in a good portion of downstream tasks of image
perception and especially image generation. In many cases, VisionLLaMA have
exhibited substantial gains over the previous state-of-the-art vision
transformers. We believe that VisionLLaMA can serve as a strong new baseline
model for vision generation and understanding. Our code will be released at
https://github.com/Meituan-AutoML/VisionLLaMA.