AToken: 비전을 위한 통합 토크나이저

초록

우리는 이미지, 비디오, 3D 자산에 걸쳐 고해상도 재구성과 의미 이해를 동시에 달성하는 최초의 통합 시각 토크나이저인 AToken을 소개합니다. 기존의 단일 모달리티에 특화된 재구성 또는 이해에 초점을 맞춘 토크나이저들과 달리, AToken은 다양한 시각 입력을 공유된 4D 잠재 공간으로 인코딩하여 단일 프레임워크 내에서 두 작업과 모달리티를 통합합니다. 구체적으로, 우리는 임의의 해상도와 시간적 지속 시간을 가진 시각 입력을 처리하기 위해 4D 회전 위치 임베딩을 갖춘 순수 트랜스포머 아키텍처를 도입했습니다. 안정적인 학습을 보장하기 위해, 우리는 지각 손실과 Gram 행렬 손실을 결합한 적대적 학습 목표를 도입하여 최신 수준의 재구성 품질을 달성했습니다. 점진적 학습 커리큘럼을 통해 AToken은 단일 이미지, 비디오, 3D로 점차 확장되며 연속적 및 이산적 잠재 토큰을 모두 지원합니다. AToken은 이미지에서 0.21 rFID와 82.2% ImageNet 정확도, 비디오에서 3.01 rFVD와 32.6% MSRVTT 검색 정확도, 3D에서 28.19 PSNR와 90.9% 분류 정확도를 달성했습니다. 다운스트림 애플리케이션에서 AToken은 시각 생성 작업(예: 연속적 및 이산적 토큰을 사용한 이미지 생성, 텍스트-투-비디오 생성, 이미지-투-3D 합성)과 이해 작업(예: 멀티모달 LLM)을 모두 가능하게 하여 모든 벤치마크에서 경쟁력 있는 성능을 보여줍니다. 이러한 결과는 통합 시각 토크나이징을 기반으로 한 차세대 멀티모달 AI 시스템의 가능성을 제시합니다.

English

We present AToken, the first unified visual tokenizer that achieves both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets. Unlike existing tokenizers that specialize in either reconstruction or understanding for single modalities, AToken encodes these diverse visual inputs into a shared 4D latent space, unifying both tasks and modalities in a single framework. Specifically, we introduce a pure transformer architecture with 4D rotary position embeddings to process visual inputs of arbitrary resolutions and temporal durations. To ensure stable training, we introduce an adversarial-free training objective that combines perceptual and Gram matrix losses, achieving state-of-the-art reconstruction quality. By employing a progressive training curriculum, AToken gradually expands from single images, videos, and 3D, and supports both continuous and discrete latent tokens. AToken achieves 0.21 rFID with 82.2% ImageNet accuracy for images, 3.01 rFVD with 32.6% MSRVTT retrieval for videos, and 28.19 PSNR with 90.9% classification accuracy for 3D. In downstream applications, AToken enables both visual generation tasks (e.g., image generation with continuous and discrete tokens, text-to-video generation, image-to-3D synthesis) and understanding tasks (e.g., multimodal LLMs), achieving competitive performance across all benchmarks. These results shed light on the next-generation multimodal AI systems built upon unified visual tokenization.