HYDRA-X: 전체적 시각 토크나이저를 갖춘 네이티브 통합 멀티모달 모델

초록

전체론적 시각 토크나이저는 다양한 시각 입력을 통합된 표현 공간으로 매핑함으로써 통합 멀티모달 모델(UMM)의 핵심을 이룬다. 본 논문에서는 단일 Vision Transformer(ViT) 내에서 이미지와 비디오 토큰화를 통합한 최초의 UMM인 HYDRA-X를 제시한다. 우리의 설계는 두 가지 핵심 과제, 즉 (1) 네이티브 ViT에 시공간 재구성 능력을 효율적으로 주입하는 것과, (2) 잠재 공간에 이미지 및 비디오 수준의 의미 인식을 내장하는 것에 의해 추진된다. 첫 번째 과제를 해결하기 위해 포괄적인 절제 실험을 통해 두 가지 주요 발견을 확인하였다: (1) 프레임 수준의 인과적 시간적 어텐션만으로도 시각 재구성이 충분하며, 전체 시공간 어텐션은 오히려 이를 저하시킨다는 점, (2) 계층적 시간적 압축이 단일 단계 대안보다 현저히 우수하다는 점이다. 두 번째 과제를 해결하기 위해, 공동 이미지-비디오 교사 감독 하에 시간적으로 압축된 특징을 업샘플링하는 경량 압축 해제기를 제안하며, 이를 통해 컴팩트한 잠재 공간 내에서 상호 보완적인 의미 구조를 강제한다. 이 포괄적 토크나이저를 기반으로, 편집 파이프라인의 원칙적인 개선을 추가로 제안한다: 소스-타겟 상호작용은 LLM 내부의 의미 수준이 아닌 토크나이저 내부의 잠재 수준에서 이루어져야 하며, 이는 편집 일관성을 크게 향상시키고 수렴을 가속화한다. 7B 밀집 모델로 구현된 HYDRA-X는 이미지 및 비디오 이해와 생성 작업 전반에서 강력한 성능을 달성하며, 향후 통합 토크나이저 기반 UMM의 길을 닦는다.

English

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.