ChatPaper.aiChatPaper

Oryx MLLM:任意分辨率下的按需时空理解

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

September 19, 2024
作者: Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao
cs.AI

摘要

视觉数据呈现多种形式,从仅有几个像素的小图标到长达数小时的视频。现有的多模态LLM通常将这些多样化的视觉输入标准化为固定分辨率,以供视觉编码器使用,并为LLM生成相似数量的标记。这种方法对于多模态理解是非最优的,对于处理具有长短视觉内容的输入也是低效的。为了解决这个问题,我们提出了Oryx,一个统一的多模态架构,用于空间-时间理解图像、视频和多视角3D场景。Oryx提供了一个按需解决方案,可以无缝高效地处理具有任意空间尺寸和时间长度的视觉输入,通过两个核心创新实现:1)一个预训练的OryxViT模型,可以将任何分辨率的图像编码为LLM友好的视觉表示;2)一个动态压缩模块,可按需支持对视觉标记进行1倍至16倍的压缩。这些设计特点使Oryx能够适应极长的视觉上下文,如视频,以较低分辨率和高压缩处理,同时在任务中保持高识别精度,例如使用本机分辨率和无压缩进行文档理解。除了架构改进外,增强的数据策划和专门针对长上下文检索和空间感知数据的训练有助于Oryx同时在图像、视频和3D多模态理解方面具有强大的能力。我们的工作在https://github.com/Oryx-mllm/Oryx上开源。
English
Visual data comes in various forms, ranging from small icons of just a few pixels to long videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual inputs to a fixed resolution for visual encoders and yield similar numbers of tokens for LLMs. This approach is non-optimal for multimodal understanding and inefficient for processing inputs with long and short visual contents. To solve the problem, we propose Oryx, a unified multimodal architecture for the spatial-temporal understanding of images, videos, and multi-view 3D scenes. Oryx offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths through two core innovations: 1) a pre-trained OryxViT model that can encode images at any resolution into LLM-friendly visual representations; 2) a dynamic compressor module that supports 1x to 16x compression on visual tokens by request. These design features enable Oryx to accommodate extremely long visual contexts, such as videos, with lower resolution and high compression while maintaining high recognition precision for tasks like document understanding with native resolution and no compression. Beyond the architectural improvements, enhanced data curation and specialized training on long-context retrieval and spatial-aware data help Oryx achieve strong capabilities in image, video, and 3D multimodal understanding simultaneously. Our work is open-sourced at https://github.com/Oryx-mllm/Oryx.
PDF262November 16, 2024