InternLM-XComposer2-4KHD:一种开创性的大型视觉语言模型,处理从336像素到4K高清的分辨率
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
April 9, 2024
作者: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang
cs.AI
摘要
大型视觉-语言模型(LVLM)领域取得了显著进展,但由于分辨率有限而难以理解细粒度视觉内容,其发展受到了阻碍。最近的努力旨在增强LVLM的高分辨率理解能力,但它们仍然受限于大约1500 x 1500像素,并且受到相对较窄的分辨率范围的限制。本文介绍了InternLM-XComposer2-4KHD,这是一项突破性的探索,旨在将LVLM的分辨率能力提升至4K HD(3840 x 1600)及以上。同时,考虑到并非所有情况都需要超高分辨率,它支持从336像素到4K标准的广泛分辨率范围,显著扩大了适用范围。具体而言,本研究通过引入一种新颖的扩展——具有自动补丁配置的动态分辨率,推进了补丁划分范式。它保持训练图像的长宽比,同时根据预训练的视觉Transformer(ViT)(336 x 336)自动变化补丁数量并配置布局,从而实现了从336像素到4K标准的动态训练分辨率。我们的研究表明,将训练分辨率提升至4K HD可以持续提升性能,而不会达到潜在改进的上限。InternLM-XComposer2-4KHD在16项基准测试中的10项中展现出与GPT-4V和Gemini Pro相匹敌甚至超越的出色能力。InternLM-XComposer2-4KHD模型系列具有70亿参数,可在https://github.com/InternLM/InternLM-XComposer 上公开获取。
English
The Large Vision-Language Model (LVLM) field has seen significant
advancements, yet its progression has been hindered by challenges in
comprehending fine-grained visual content due to limited resolution. Recent
efforts have aimed to enhance the high-resolution understanding capabilities of
LVLMs, yet they remain capped at approximately 1500 x 1500 pixels and
constrained to a relatively narrow resolution range. This paper represents
InternLM-XComposer2-4KHD, a groundbreaking exploration into elevating LVLM
resolution capabilities up to 4K HD (3840 x 1600) and beyond. Concurrently,
considering the ultra-high resolution may not be necessary in all scenarios, it
supports a wide range of diverse resolutions from 336 pixels to 4K standard,
significantly broadening its scope of applicability. Specifically, this
research advances the patch division paradigm by introducing a novel extension:
dynamic resolution with automatic patch configuration. It maintains the
training image aspect ratios while automatically varying patch counts and
configuring layouts based on a pre-trained Vision Transformer (ViT) (336 x
336), leading to dynamic training resolution from 336 pixels to 4K standard.
Our research demonstrates that scaling training resolution up to 4K HD leads to
consistent performance enhancements without hitting the ceiling of potential
improvements. InternLM-XComposer2-4KHD shows superb capability that matches or
even surpasses GPT-4V and Gemini Pro in 10 of the 16 benchmarks. The
InternLM-XComposer2-4KHD model series with 7B parameters are publicly available
at https://github.com/InternLM/InternLM-XComposer.Summary
AI-Generated Summary