Ferret-v2:大型语言模型在指代和基准任务中的改进基线
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
April 11, 2024
作者: Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfei Yang
cs.AI
摘要
尽管Ferret将区域理解无缝集成到大型语言模型(LLM)中,以促进其指代和基础能力,但它存在一定的限制:受预训练的固定视觉编码器的限制,无法在更广泛的任务上表现良好。在这项工作中,我们揭示了Ferret-v2,这是对Ferret的重大升级,具有三个关键设计。 (1)任何分辨率的基础和指代:一种灵活的方法,可以轻松处理更高的图像分辨率,提高模型处理和理解图像细节的能力。 (2)多粒度视觉编码:通过集成额外的DINOv2编码器,模型学习更好和多样化的全局和细粒度视觉信息的基础上下文。 (3)三阶段训练范式:除了图像-标题对齐外,提出了一个额外阶段,用于在最终指导调整之前进行高分辨率密集对齐。实验证明,由于其高分辨率缩放和细粒度视觉处理,Ferret-v2在Ferret和其他最先进方法上提供了实质性的改进。
English
While Ferret seamlessly integrates regional understanding into the Large
Language Model (LLM) to facilitate its referring and grounding capability, it
poses certain limitations: constrained by the pre-trained fixed visual encoder
and failed to perform well on broader tasks. In this work, we unveil Ferret-v2,
a significant upgrade to Ferret, with three key designs. (1) Any resolution
grounding and referring: A flexible approach that effortlessly handles higher
image resolution, improving the model's ability to process and understand
images in greater detail. (2) Multi-granularity visual encoding: By integrating
the additional DINOv2 encoder, the model learns better and diverse underlying
contexts for global and fine-grained visual information. (3) A three-stage
training paradigm: Besides image-caption alignment, an additional stage is
proposed for high-resolution dense alignment before the final instruction
tuning. Experiments show that Ferret-v2 provides substantial improvements over
Ferret and other state-of-the-art methods, thanks to its high-resolution
scaling and fine-grained visual processing.Summary
AI-Generated Summary