OtterHD：高解析度多模態模型

摘要

本文介紹了OtterHD-8B，一個創新的多模型，從Fuyu-8B進化而來，專門設計用於以精細粒度解釋高解析度視覺輸入。與受固定大小視覺編碼器限制的傳統模型不同，OtterHD-8B具有處理靈活輸入尺寸的能力，確保其在各種推理需求上的多功能性。除了這個模型，我們還引入了MagnifierBench，一個評估框架，旨在審查模型識別微小物件的細節和空間關係的能力。我們的比較分析顯示，當前領先的模型在這個基準測試上表現不佳，而OtterHD-8B，在直接處理高解析度輸入時，表現優於同類型模型相當大的幅度。研究結果闡明了不同模型在視覺信息處理中的結構差異，以及視覺編碼器的預訓練解析度差異對模型在這些基準測試中有效性的影響。我們的研究突顯了在大型多模型中靈活性和高解析度輸入能力的關鍵作用，同時也展示了Fuyu架構處理複雜視覺數據的潛力。

English

In this paper, we present OtterHD-8B, an innovative multimodal model evolved from Fuyu-8B, specifically engineered to interpret high-resolution visual inputs with granular precision. Unlike conventional models that are constrained by fixed-size vision encoders, OtterHD-8B boasts the ability to handle flexible input dimensions, ensuring its versatility across various inference requirements. Alongside this model, we introduce MagnifierBench, an evaluation framework designed to scrutinize models' ability to discern minute details and spatial relationships of small objects. Our comparative analysis reveals that while current leading models falter on this benchmark, OtterHD-8B, particularly when directly processing high-resolution inputs, outperforms its counterparts by a substantial margin. The findings illuminate the structural variances in visual information processing among different models and the influence that the vision encoders' pre-training resolution disparities have on model effectiveness within such benchmarks. Our study highlights the critical role of flexibility and high-resolution input capabilities in large multimodal models and also exemplifies the potential inherent in the Fuyu architecture's simplicity for handling complex visual data.

OtterHD：高解析度多模態模型

OtterHD: A High-Resolution Multi-modality Model

摘要

Support