HYDRA-X: 使用整体视觉分词器的原生统一多模态模型

摘要

整体性视觉标记器是统一多模态模型（UMM）的基础，因其能将多种视觉输入映射到统一的表示空间中。本文提出HYDRA-X，这是首个在单一视觉变换器（ViT）中统一图像与视频标记化的UMM。我们的设计围绕两个核心挑战展开：高效地向原生ViT中注入时空重建能力，以及将图像与视频级别的语义感知嵌入潜在空间。为解决第一个挑战，全面消融实验揭示了两个关键发现：（1）帧级因果时间注意力足以用于视觉重建，而全时空注意力反而会降低重建质量；（2）分层时间压缩显著优于单步压缩方案。为解决第二个挑战，我们提出一种轻量化解压缩器，在联合图像-视频教师监督下对时间压缩后的特征进行上采样，从而在紧凑的潜在空间中强制引入互补的语义结构。基于这一整体性标记器，我们进一步提出一种对编辑流程的原则性改进：源-目标交互应发生在标记器内部的潜在级别，而非大语言模型（LLM）内部的语义级别，从而显著提升编辑一致性并加速收敛。在7B稠密模型上进行实例化后，HYDRA-X在图像与视频理解及生成任务上均展现出强劲性能，为未来基于统一标记器的UMM铺平了道路。

English

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.