OtterHD:一个高分辨率多模态模型
OtterHD: A High-Resolution Multi-modality Model
November 7, 2023
作者: Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu
cs.AI
摘要
本文介绍了OtterHD-8B,这是一种创新的多模态模型,是从Fuyu-8B演变而来,专门设计用于以精细粒度解释高分辨率视觉输入。与受固定大小视觉编码器限制的传统模型不同,OtterHD-8B具有处理灵活输入尺寸的能力,确保其在各种推理需求中的多功能性。除了这个模型,我们还介绍了MagnifierBench,这是一个评估框架,旨在审查模型识别小物体的微小细节和空间关系的能力。我们的比较分析显示,尽管当前领先的模型在这一基准上表现不佳,但OtterHD-8B在直接处理高分辨率输入时,表现出色,明显优于同类产品。研究结果揭示了不同模型在视觉信息处理中的结构差异,以及视觉编码器预训练分辨率差异对模型在这类基准中有效性的影响。我们的研究突出了灵活性和高分辨率输入能力在大型多模态模型中的关键作用,并展示了Fuyu架构处理复杂视觉数据的潜力。
English
In this paper, we present OtterHD-8B, an innovative multimodal model evolved
from Fuyu-8B, specifically engineered to interpret high-resolution visual
inputs with granular precision. Unlike conventional models that are constrained
by fixed-size vision encoders, OtterHD-8B boasts the ability to handle flexible
input dimensions, ensuring its versatility across various inference
requirements. Alongside this model, we introduce MagnifierBench, an evaluation
framework designed to scrutinize models' ability to discern minute details and
spatial relationships of small objects. Our comparative analysis reveals that
while current leading models falter on this benchmark, OtterHD-8B, particularly
when directly processing high-resolution inputs, outperforms its counterparts
by a substantial margin. The findings illuminate the structural variances in
visual information processing among different models and the influence that the
vision encoders' pre-training resolution disparities have on model
effectiveness within such benchmarks. Our study highlights the critical role of
flexibility and high-resolution input capabilities in large multimodal models
and also exemplifies the potential inherent in the Fuyu architecture's
simplicity for handling complex visual data.