Ferret-v2：使用大型語言模型進行指代和基準的改進基準

摘要

雖然Ferret 將區域理解融入大型語言模型 (LLM) 中，以促進其指代和基礎能力，但它存在一定的限制：受限於預先訓練的固定視覺編碼器，並且在更廣泛的任務上表現不佳。在這項工作中，我們揭示了Ferret-v2，這是Ferret 的一次重大升級，具有三個關鍵設計。(1) 任何解析度基礎和指代：一種靈活的方法，輕鬆處理更高的圖像解析度，提高模型處理和理解圖像細節的能力。(2) 多粒度視覺編碼：通過整合額外的 DINOv2 編碼器，模型學習更好和多樣的全局和細粒度視覺信息的基礎上下文。(3) 三階段訓練範式：除了圖像-標題對齊外，提出了一個額外的階段，用於在最終指令調整之前進行高解析度的密集對齊。實驗表明，由於其高解析度縮放和細粒度視覺處理，Ferret-v2 在Ferret 和其他最先進的方法上提供了顯著的改進。

English

While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.

Ferret-v2：使用大型語言模型進行指代和基準的改進基準

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

摘要

Support