ChatPaper.aiChatPaper

Ferret-v2:使用大型語言模型進行指代和基準的改進基準

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

April 11, 2024
作者: Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfei Yang
cs.AI

摘要

雖然Ferret 將區域理解融入大型語言模型 (LLM) 中,以促進其指代和基礎能力,但它存在一定的限制:受限於預先訓練的固定視覺編碼器,並且在更廣泛的任務上表現不佳。在這項工作中,我們揭示了Ferret-v2,這是Ferret 的一次重大升級,具有三個關鍵設計。(1) 任何解析度基礎和指代:一種靈活的方法,輕鬆處理更高的圖像解析度,提高模型處理和理解圖像細節的能力。(2) 多粒度視覺編碼:通過整合額外的 DINOv2 編碼器,模型學習更好和多樣的全局和細粒度視覺信息的基礎上下文。(3) 三階段訓練範式:除了圖像-標題對齊外,提出了一個額外的階段,用於在最終指令調整之前進行高解析度的密集對齊。實驗表明,由於其高解析度縮放和細粒度視覺處理,Ferret-v2 在Ferret 和其他最先進的方法上提供了顯著的改進。
English
While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.
PDF333December 15, 2024