OtterHD: 高解像度マルチモダリティモデル

要旨

本論文では、Fuyu-8Bから進化した革新的なマルチモーダルモデルであるOtterHD-8Bを紹介する。このモデルは、高解像度の視覚入力を細部まで正確に解釈するために特別に設計されている。固定サイズのビジョンエンコーダに制約される従来のモデルとは異なり、OtterHD-8Bは柔軟な入力次元を扱う能力を備えており、さまざまな推論要件に対応する汎用性を確保している。このモデルとともに、微小な物体の詳細や空間関係を識別するモデルの能力を精査するための評価フレームワークであるMagnifierBenchを導入する。比較分析の結果、現在の主要モデルはこのベンチマークで苦戦する一方で、特に高解像度入力を直接処理する場合のOtterHD-8Bは、他のモデルを大幅に上回る性能を示すことが明らかになった。これらの知見は、異なるモデル間での視覚情報処理の構造的差異、およびビジョンエンコーダの事前学習解像度の違いが、このようなベンチマークにおけるモデルの有効性に与える影響を浮き彫りにしている。本研究は、大規模マルチモーダルモデルにおける柔軟性と高解像度入力能力の重要性を強調するとともに、複雑な視覚データを扱うためのFuyuアーキテクチャのシンプルさに内在する可能性を例示している。

English

In this paper, we present OtterHD-8B, an innovative multimodal model evolved from Fuyu-8B, specifically engineered to interpret high-resolution visual inputs with granular precision. Unlike conventional models that are constrained by fixed-size vision encoders, OtterHD-8B boasts the ability to handle flexible input dimensions, ensuring its versatility across various inference requirements. Alongside this model, we introduce MagnifierBench, an evaluation framework designed to scrutinize models' ability to discern minute details and spatial relationships of small objects. Our comparative analysis reveals that while current leading models falter on this benchmark, OtterHD-8B, particularly when directly processing high-resolution inputs, outperforms its counterparts by a substantial margin. The findings illuminate the structural variances in visual information processing among different models and the influence that the vision encoders' pre-training resolution disparities have on model effectiveness within such benchmarks. Our study highlights the critical role of flexibility and high-resolution input capabilities in large multimodal models and also exemplifies the potential inherent in the Fuyu architecture's simplicity for handling complex visual data.

OtterHD: 高解像度マルチモダリティモデル

OtterHD: A High-Resolution Multi-modality Model

要旨

Support