HYDRA-X: ホリスティックな視覚トークナイザーを備えたネイティブ統合マルチモーダルモデル

要旨

包括的な視覚トークナイザは、多様な視覚入力を統一された表現空間にマッピングするため、統一型マルチモーダルモデル（UMM）の基盤となる。本論文では、画像と動画のトークン化を単一のVision Transformer（ViT）内で統一する初のUMMであるHYDRA-Xを提案する。我々の設計は、2つの核心的な課題に基づいている：ネイティブViTに時空間再構成能力を効率的に注入すること、そして潜在空間に画像レベルおよび動画レベルの意味理解を埋め込むこと。最初の課題に対処するため、包括的なアブレーション実験により2つの重要な知見が明らかになった：（1）フレームレベルの因果的時間注意機構が視覚再構成に十分であり、完全な時空間注意機構はそれを劣化させること、（2）階層的時間圧縮が単一段階の代替手法を大幅に上回ること。2つ目の課題に取り組むため、我々は、画像と動画の統合教師信号の下で時間的に圧縮された特徴をアップサンプリングする軽量なデコンプレッサを提案し、これによりコンパクトな潜在空間内で補完的な意味構造を強制する。この包括的トークナイザに基づき、我々はさらに編集パイプラインの原理的な改善を提案する：ソースとターゲットの相互作用は、LLM内部の意味レベルではなく、トークナイザ内部の潜在レベルで行うべきであり、これにより編集の一貫性が大幅に向上し、収束が加速される。7Bの高密度モデルで具体化されたHYDRA-Xは、画像および動画の理解と生成タスクにおいて強力な性能を達成し、将来の統一トークナイザUMMへの道を開く。

English

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.