Oryx MLLM: 任意の解像度でのオンデマンド空間時系列理解

要旨

視覚データは、わずか数ピクセルの小さなアイコンから数時間にわたる長いビデオまで、さまざまな形式で提供されます。既存のマルチモーダルLLMは通常、これら多様な視覚入力を固定解像度に標準化し、視覚エンコーダーに適したトークン数をLLMに提供します。このアプローチは、マルチモーダル理解にとって最適でなく、長いおよび短い視覚コンテンツを処理する際に非効率です。この問題を解決するために、私たちはOryxを提案します。これは、画像、ビデオ、およびマルチビュー3Dシーンの空間的・時間的理解のための統合されたマルチモーダルアーキテクチャです。Oryxは、任意の空間サイズと時間長で視覚入力をシームレスかつ効率的に処理するためのオンデマンドソリューションを提供します。これは、次の2つの中核的イノベーションによって可能となります：1）任意の解像度の画像をLLMに適した視覚表現にエンコードできる事前学習済みのOryxViTモデル；2）要求に応じて視覚トークンを1倍から16倍の圧縮をサポートするダイナミックコンプレッサーモジュール。これらの設計機能により、Oryxは、解像度を下げ、高い圧縮率で非常に長い視覚コンテキスト（例：ビデオ）を収容し、同時に、ドキュメント理解などのタスクにおいてネイティブ解像度と圧縮なしで高い認識精度を維持します。アーキテクチャの改善に加えて、長いコンテキストの取得と空間認識データに特化したデータキュレーションと専門的なトレーニングにより、Oryxは画像、ビデオ、および3Dマルチモーダル理解において強力な機能を同時に実現します。私たちの取り組みは、https://github.com/Oryx-mllm/Oryx でオープンソースとして公開されています。

English

Visual data comes in various forms, ranging from small icons of just a few pixels to long videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual inputs to a fixed resolution for visual encoders and yield similar numbers of tokens for LLMs. This approach is non-optimal for multimodal understanding and inefficient for processing inputs with long and short visual contents. To solve the problem, we propose Oryx, a unified multimodal architecture for the spatial-temporal understanding of images, videos, and multi-view 3D scenes. Oryx offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths through two core innovations: 1) a pre-trained OryxViT model that can encode images at any resolution into LLM-friendly visual representations; 2) a dynamic compressor module that supports 1x to 16x compression on visual tokens by request. These design features enable Oryx to accommodate extremely long visual contexts, such as videos, with lower resolution and high compression while maintaining high recognition precision for tasks like document understanding with native resolution and no compression. Beyond the architectural improvements, enhanced data curation and specialized training on long-context retrieval and spatial-aware data help Oryx achieve strong capabilities in image, video, and 3D multimodal understanding simultaneously. Our work is open-sourced at https://github.com/Oryx-mllm/Oryx.

Oryx MLLM: 任意の解像度でのオンデマンド空間時系列理解

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

要旨

Support