OtterHD: 고해상도 다중모달리티 모델

초록

본 논문에서는 Fuyu-8B에서 진화한 혁신적인 멀티모달 모델인 OtterHD-8B를 소개한다. 이 모델은 고해상도 시각 입력을 세밀한 정밀도로 해석하도록 특별히 설계되었다. 고정 크기의 비전 인코더에 제약을 받는 기존 모델과 달리, OtterHD-8B는 유연한 입력 차원을 처리할 수 있는 능력을 자랑하며, 다양한 추론 요구 사항에 걸쳐 다용도로 사용될 수 있다. 이 모델과 함께, 우리는 모델이 작은 물체의 미세한 세부 사항과 공간적 관계를 식별하는 능력을 면밀히 검토하기 위해 설계된 평가 프레임워크인 MagnifierBench를 소개한다. 우리의 비교 분석은 현재의 선두 모델들이 이 벤치마크에서 실패하는 반면, 특히 고해상도 입력을 직접 처리할 때 OtterHD-8B가 상당한 차이로 다른 모델들을 능가한다는 것을 보여준다. 이러한 결과는 다양한 모델 간의 시각 정보 처리 구조적 차이와 비전 인코더의 사전 학습 해상도 차이가 이러한 벤치마크 내에서 모델의 효과성에 미치는 영향을 밝힌다. 우리의 연구는 대형 멀티모달 모델에서 유연성과 고해상도 입력 능력의 중요성을 강조하며, 복잡한 시각 데이터를 처리하는 데 있어 Fuyu 아키텍처의 단순성이 지닌 잠재력을 예시한다.

English

In this paper, we present OtterHD-8B, an innovative multimodal model evolved from Fuyu-8B, specifically engineered to interpret high-resolution visual inputs with granular precision. Unlike conventional models that are constrained by fixed-size vision encoders, OtterHD-8B boasts the ability to handle flexible input dimensions, ensuring its versatility across various inference requirements. Alongside this model, we introduce MagnifierBench, an evaluation framework designed to scrutinize models' ability to discern minute details and spatial relationships of small objects. Our comparative analysis reveals that while current leading models falter on this benchmark, OtterHD-8B, particularly when directly processing high-resolution inputs, outperforms its counterparts by a substantial margin. The findings illuminate the structural variances in visual information processing among different models and the influence that the vision encoders' pre-training resolution disparities have on model effectiveness within such benchmarks. Our study highlights the critical role of flexibility and high-resolution input capabilities in large multimodal models and also exemplifies the potential inherent in the Fuyu architecture's simplicity for handling complex visual data.

OtterHD: 고해상도 다중모달리티 모델

OtterHD: A High-Resolution Multi-modality Model

초록

Support