InternLM-XComposer2-4KHD:一個開創性的大型視覺語言模型,處理從336像素到4K高清的解析度
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
April 9, 2024
作者: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang
cs.AI
摘要
大視覺語言模型(LVLM)領域取得了顯著進展,但由於分辨率有限而對細微視覺內容的理解存在挑戰,進展受到了阻礙。最近的努力旨在增強LVLM的高分辨率理解能力,但它們仍然受限於約1500 x 1500像素並受限於相對較窄的分辨率範圍。本文介紹了InternLM-XComposer2-4KHD,這是一項開創性的探索,旨在將LVLM的分辨率能力提升至4K HD(3840 x 1600)及以上。同時,考慮到在所有情況下都可能不需要超高分辨率,它支持從336像素到4K標準的各種不同分辨率,顯著擴大了其應用範圍。具體來說,本研究通過引入一種新的擴展:具有自動補丁配置的動態分辨率,推進了補丁劃分範式。它保持了訓練圖像的長寬比,同時根據預先訓練的視覺Transformer(ViT)(336 x 336)自動變化補丁數量並配置佈局,從而實現了從336像素到4K標準的動態訓練分辨率。我們的研究表明,將訓練分辨率提升至4K HD可以持續提升性能,而不會達到潛在改進的上限。InternLM-XComposer2-4KHD在16個基準測試中的10個中展現出優異能力,與GPT-4V和Gemini Pro相匹敵甚至超越。InternLM-XComposer2-4KHD模型系列具有70億參數,可在https://github.com/InternLM/InternLM-XComposer 公開獲得。
English
The Large Vision-Language Model (LVLM) field has seen significant
advancements, yet its progression has been hindered by challenges in
comprehending fine-grained visual content due to limited resolution. Recent
efforts have aimed to enhance the high-resolution understanding capabilities of
LVLMs, yet they remain capped at approximately 1500 x 1500 pixels and
constrained to a relatively narrow resolution range. This paper represents
InternLM-XComposer2-4KHD, a groundbreaking exploration into elevating LVLM
resolution capabilities up to 4K HD (3840 x 1600) and beyond. Concurrently,
considering the ultra-high resolution may not be necessary in all scenarios, it
supports a wide range of diverse resolutions from 336 pixels to 4K standard,
significantly broadening its scope of applicability. Specifically, this
research advances the patch division paradigm by introducing a novel extension:
dynamic resolution with automatic patch configuration. It maintains the
training image aspect ratios while automatically varying patch counts and
configuring layouts based on a pre-trained Vision Transformer (ViT) (336 x
336), leading to dynamic training resolution from 336 pixels to 4K standard.
Our research demonstrates that scaling training resolution up to 4K HD leads to
consistent performance enhancements without hitting the ceiling of potential
improvements. InternLM-XComposer2-4KHD shows superb capability that matches or
even surpasses GPT-4V and Gemini Pro in 10 of the 16 benchmarks. The
InternLM-XComposer2-4KHD model series with 7B parameters are publicly available
at https://github.com/InternLM/InternLM-XComposer.Summary
AI-Generated Summary