HYDRA-X: Native verenigde multimodale modellen met holistische visuele tokenizers

Samenvatting

Holistische visuele tokenizers zijn fundamenteel voor uniforme multimodale modellen (UMM's) omdat ze diverse visuele inputs in een uniforme representatieruimte in kaart brengen. In dit artikel presenteren we HYDRA-X, de eerste UMM die afbeelding- en videotokenisatie binnen één Vision Transformer (ViT) verenigt. Ons ontwerp wordt gedreven door twee kernuitdagingen: het efficiënt injecteren van spatiotemporele reconstructiecapaciteit in een native ViT, en het inbedden van semantisch bewustzijn op beeld- en videoniveau in de latente ruimte. Om de eerste aan te pakken, onthullen uitgebreide ablatiestudies twee belangrijke bevindingen: (1) causale temporele aandacht op frameniveau is voldoende voor visuele reconstructie, terwijl volledige spatiotemporele aandacht deze verslechtert; en (2) hiërarchische temporele compressie presteert aanzienlijk beter dan alternatieven in één stap. Om de tweede aan te pakken, stellen we een lichtgewicht decompressor voor die temporeel gecomprimeerde kenmerken opsamplt onder gezamenlijk toezicht van een leraar op beeld en video, waardoor complementaire semantische structuren in de compacte latente ruimte worden afgedwongen. Voortbouwend op deze holistische tokenizer stellen we verder een principiële verbetering van de bewerkingspijplijn voor: bron-doel-interactie moet plaatsvinden op het latente niveau binnen de tokenizer in plaats van op het semantische niveau binnen de LLM, wat de bewerkingsconsistentie aanzienlijk verbetert en de convergentie versnelt. Geïnstantieerd op het 7B dichte model, behaalt HYDRA-X sterke prestaties op het gebied van beeld- en videobegrip en -generatietaken, wat de weg vrijmaakt voor toekomstige UMM's met uniforme tokenizers.

English

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.