Ferret-v2: 大規模言語モデルを用いた参照と接地のための改良版ベースライン

要旨

Ferretは、大規模言語モデル（LLM）に領域理解をシームレスに統合し、参照と接地能力を促進しますが、いくつかの制限があります。事前学習済みの固定視覚エンコーダに制約され、より広範なタスクで良好な性能を発揮できませんでした。本研究では、Ferret-v2を発表します。これはFerretの大幅なアップグレードであり、3つの主要な設計を特徴としています。(1)任意解像度の接地と参照：高解像度画像を容易に処理する柔軟なアプローチにより、モデルの画像処理と詳細理解能力が向上します。(2)マルチ粒度視覚エンコーディング：追加のDINOv2エンコーダを統合することで、モデルはグローバルおよび微細な視覚情報の多様な基盤コンテキストをより良く学習します。(3)3段階のトレーニングパラダイム：画像キャプションのアラインメントに加えて、最終的な指示チューニングの前に高解像度の密なアラインメントのための追加段階を提案します。実験結果は、Ferret-v2が高解像度スケーリングと微細な視覚処理により、Ferretや他の最先端手法に対して大幅な改善を提供することを示しています。

English

While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.

Ferret-v2: 大規模言語モデルを用いた参照と接地のための改良版ベースライン

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

要旨

Support