CaTok:驯服一维因果图像令牌化的均值流
CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
March 6, 2026
作者: Yitong Chen, Zuxuan Wu, Xipeng Qiu, Yu-Gang Jiang
cs.AI
摘要
自迴歸(AR)語言模型依賴於因果標記化,但將此範式擴展至視覺領域仍非易事。現有視覺標記器要麼將二維圖像塊展平為非因果序列,要麼強制採用與「下一標記預測」模式不匹配的啟發式排序。近期擴散自編碼器同樣存在侷限:在解碼器中對所有標記進行條件化缺乏因果性,而應用嵌套丟棄機制則會引入不平衡問題。為解決這些難題,我們提出CaTok——一種配備均值流解碼器的一維因果圖像標記器。如圖1所示,通過在時間區間內選擇標記並將其與均值流目標綁定,CaTok能學習支持快速單步生成與高保真多步採樣的因果一維表徵,同時自然捕獲跨標記區間的多元視覺概念。為進一步穩定並加速訓練,我們提出簡潔的正則化方法REPA-A,將編碼器特徵與視覺基礎模型(VFM)對齊。實驗表明,CaTok在ImageNet重建任務上達到最先進水平,僅用更少訓練週期即實現0.75 FID、22.53 PSNR和0.674 SSIM,且其AR模型性能可與主流方法媲美。
English
Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains non-trivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the "next-token prediction" pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, as illustrated in Fig. 1, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate training, we propose a straightforward regularization REPA-A, which aligns encoder features with Vision Foundation Models (VFMs). Experiments demonstrate that CaTok achieves state-of-the-art results on ImageNet reconstruction, reaching 0.75 FID, 22.53 PSNR and 0.674 SSIM with fewer training epochs, and the AR model attains performance comparable to leading approaches.