CaTok：驯服一维因果图像令牌化的均值流

摘要

自迴歸（AR）語言模型依賴於因果標記化，但將此範式擴展至視覺領域仍非易事。現有視覺標記器要麼將二維圖像塊展平為非因果序列，要麼強制採用與「下一標記預測」模式不匹配的啟發式排序。近期擴散自編碼器同樣存在侷限：在解碼器中對所有標記進行條件化缺乏因果性，而應用嵌套丟棄機制則會引入不平衡問題。為解決這些難題，我們提出CaTok——一種配備均值流解碼器的一維因果圖像標記器。如圖1所示，通過在時間區間內選擇標記並將其與均值流目標綁定，CaTok能學習支持快速單步生成與高保真多步採樣的因果一維表徵，同時自然捕獲跨標記區間的多元視覺概念。為進一步穩定並加速訓練，我們提出簡潔的正則化方法REPA-A，將編碼器特徵與視覺基礎模型（VFM）對齊。實驗表明，CaTok在ImageNet重建任務上達到最先進水平，僅用更少訓練週期即實現0.75 FID、22.53 PSNR和0.674 SSIM，且其AR模型性能可與主流方法媲美。

English

Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains non-trivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the "next-token prediction" pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, as illustrated in Fig. 1, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate training, we propose a straightforward regularization REPA-A, which aligns encoder features with Vision Foundation Models (VFMs). Experiments demonstrate that CaTok achieves state-of-the-art results on ImageNet reconstruction, reaching 0.75 FID, 22.53 PSNR and 0.674 SSIM with fewer training epochs, and the AR model attains performance comparable to leading approaches.

CaTok：驯服一维因果图像令牌化的均值流

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

摘要

Support