AToken:視覺領域的統一化分詞器
AToken: A Unified Tokenizer for Vision
September 17, 2025
作者: Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, Yinfei Yang
cs.AI
摘要
我們提出了AToken,這是首個能夠在圖像、視頻和3D資產上同時實現高保真重建與語義理解的統一視覺標記器。與現有專注於單一模態重建或理解的標記器不同,AToken將這些多樣的視覺輸入編碼到一個共享的4D潛在空間中,在單一框架內統一了任務與模態。具體而言,我們引入了一種純Transformer架構,配備4D旋轉位置嵌入,以處理任意分辨率和時間長度的視覺輸入。為了確保訓練的穩定性,我們提出了一種無對抗的訓練目標,結合感知損失和Gram矩陣損失,達到了最先進的重建質量。通過採用漸進式訓練課程,AToken逐步從單一圖像、視頻擴展到3D,並支持連續和離散的潛在標記。AToken在圖像上實現了0.21的rFID和82.2%的ImageNet準確率,在視頻上實現了3.01的rFVD和32.6%的MSRVTT檢索率,在3D上實現了28.19的PSNR和90.9%的分類準確率。在下游應用中,AToken既支持視覺生成任務(如使用連續和離散標記的圖像生成、文本到視頻生成、圖像到3D合成),也支持理解任務(如多模態大語言模型),在所有基準測試中均展現出競爭力。這些成果為基於統一視覺標記化的下一代多模態AI系統指明了方向。
English
We present AToken, the first unified visual tokenizer that achieves both
high-fidelity reconstruction and semantic understanding across images, videos,
and 3D assets. Unlike existing tokenizers that specialize in either
reconstruction or understanding for single modalities, AToken encodes these
diverse visual inputs into a shared 4D latent space, unifying both tasks and
modalities in a single framework. Specifically, we introduce a pure transformer
architecture with 4D rotary position embeddings to process visual inputs of
arbitrary resolutions and temporal durations. To ensure stable training, we
introduce an adversarial-free training objective that combines perceptual and
Gram matrix losses, achieving state-of-the-art reconstruction quality. By
employing a progressive training curriculum, AToken gradually expands from
single images, videos, and 3D, and supports both continuous and discrete latent
tokens. AToken achieves 0.21 rFID with 82.2% ImageNet accuracy for images, 3.01
rFVD with 32.6% MSRVTT retrieval for videos, and 28.19 PSNR with 90.9%
classification accuracy for 3D. In downstream applications, AToken enables both
visual generation tasks (e.g., image generation with continuous and discrete
tokens, text-to-video generation, image-to-3D synthesis) and understanding
tasks (e.g., multimodal LLMs), achieving competitive performance across all
benchmarks. These results shed light on the next-generation multimodal AI
systems built upon unified visual tokenization.