AToken:视觉领域的统一分词器
AToken: A Unified Tokenizer for Vision
September 17, 2025
作者: Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, Yinfei Yang
cs.AI
摘要
我们推出了AToken,这是首个能够在图像、视频和3D资产上同时实现高保真重建与语义理解的统一视觉分词器。与现有仅专注于单一模态重建或理解的分词器不同,AToken将这些多样化的视觉输入编码到一个共享的4D潜在空间中,在一个框架内统一了任务与模态。具体而言,我们引入了一种纯Transformer架构,配备4D旋转位置嵌入,以处理任意分辨率和时长的视觉输入。为了确保训练的稳定性,我们提出了一种无对抗的训练目标,结合感知损失和Gram矩阵损失,实现了最先进的重建质量。通过采用渐进式训练课程,AToken逐步从单张图像、视频扩展到3D,并支持连续与离散的潜在标记。AToken在图像上实现了0.21的rFID和82.2%的ImageNet准确率,在视频上达到了3.01的rFVD和32.6%的MSRVTT检索率,在3D上则获得了28.19的PSNR和90.9%的分类准确率。在下游应用中,AToken不仅支持视觉生成任务(如使用连续与离散标记的图像生成、文本到视频生成、图像到3D合成),还支持理解任务(如多模态大语言模型),在所有基准测试中均展现出竞争力。这些成果为基于统一视觉分词的新一代多模态AI系统指明了方向。
English
We present AToken, the first unified visual tokenizer that achieves both
high-fidelity reconstruction and semantic understanding across images, videos,
and 3D assets. Unlike existing tokenizers that specialize in either
reconstruction or understanding for single modalities, AToken encodes these
diverse visual inputs into a shared 4D latent space, unifying both tasks and
modalities in a single framework. Specifically, we introduce a pure transformer
architecture with 4D rotary position embeddings to process visual inputs of
arbitrary resolutions and temporal durations. To ensure stable training, we
introduce an adversarial-free training objective that combines perceptual and
Gram matrix losses, achieving state-of-the-art reconstruction quality. By
employing a progressive training curriculum, AToken gradually expands from
single images, videos, and 3D, and supports both continuous and discrete latent
tokens. AToken achieves 0.21 rFID with 82.2% ImageNet accuracy for images, 3.01
rFVD with 32.6% MSRVTT retrieval for videos, and 28.19 PSNR with 90.9%
classification accuracy for 3D. In downstream applications, AToken enables both
visual generation tasks (e.g., image generation with continuous and discrete
tokens, text-to-video generation, image-to-3D synthesis) and understanding
tasks (e.g., multimodal LLMs), achieving competitive performance across all
benchmarks. These results shed light on the next-generation multimodal AI
systems built upon unified visual tokenization.