OtterHD：一个高分辨率多模态模型

摘要

本文介绍了OtterHD-8B，这是一种创新的多模态模型，是从Fuyu-8B演变而来，专门设计用于以精细粒度解释高分辨率视觉输入。与受固定大小视觉编码器限制的传统模型不同，OtterHD-8B具有处理灵活输入尺寸的能力，确保其在各种推理需求中的多功能性。除了这个模型，我们还介绍了MagnifierBench，这是一个评估框架，旨在审查模型识别小物体的微小细节和空间关系的能力。我们的比较分析显示，尽管当前领先的模型在这一基准上表现不佳，但OtterHD-8B在直接处理高分辨率输入时，表现出色，明显优于同类产品。研究结果揭示了不同模型在视觉信息处理中的结构差异，以及视觉编码器预训练分辨率差异对模型在这类基准中有效性的影响。我们的研究突出了灵活性和高分辨率输入能力在大型多模态模型中的关键作用，并展示了Fuyu架构处理复杂视觉数据的潜力。

English

In this paper, we present OtterHD-8B, an innovative multimodal model evolved from Fuyu-8B, specifically engineered to interpret high-resolution visual inputs with granular precision. Unlike conventional models that are constrained by fixed-size vision encoders, OtterHD-8B boasts the ability to handle flexible input dimensions, ensuring its versatility across various inference requirements. Alongside this model, we introduce MagnifierBench, an evaluation framework designed to scrutinize models' ability to discern minute details and spatial relationships of small objects. Our comparative analysis reveals that while current leading models falter on this benchmark, OtterHD-8B, particularly when directly processing high-resolution inputs, outperforms its counterparts by a substantial margin. The findings illuminate the structural variances in visual information processing among different models and the influence that the vision encoders' pre-training resolution disparities have on model effectiveness within such benchmarks. Our study highlights the critical role of flexibility and high-resolution input capabilities in large multimodal models and also exemplifies the potential inherent in the Fuyu architecture's simplicity for handling complex visual data.

OtterHD：一个高分辨率多模态模型

OtterHD: A High-Resolution Multi-modality Model

摘要

Support