CaTok: 1次元因果的画像トークン化における平均フローの制御

要旨

自己回帰（AR）言語モデルは因果的トークン化に依存するが、このパラダイムを視覚領域に拡張することは依然として容易ではない。現在の視覚トークナイザーは、2Dパッチを非因果的系列に平坦化するか、「次トークン予測」パターンと整合しないヒューリスティックな順序付けを強制している。最近の拡散オートエンコーダも同様に限界がある：デコーダを全トークンで条件付けると因果性が欠如し、ネストされたドロップアウト機構を適用すると不均衡が生じる。これらの課題に対処するため、本論文ではMeanFlowデコーダを備えた1D因果的画像トークナイザーCaTokを提案する。図1に示すように、時間間隔でトークンを選択しMeanFlow目的関数に紐付けることで、CaTokは高速な1ステップ生成と高精細なマルチステップサンプリングを両立しつつ、トークン間隔にわたる多様な視覚概念を自然に捕捉する因果的1D表現を学習する。訓練の安定化と高速化をさらに図るため、エンコーダ特徴を視覚基盤モデル（VFM）と整合させる単純な正則化REPA-Aを提案する。実験により、CaTokがImageNet再構築において0.75 FID、22.53 PSNR、0.674 SSIMというstate-of-the-art結果を少ない訓練エポックで達成し、ARモデルが主要手法に匹敵する性能を得ることを実証する。

English

Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains non-trivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the "next-token prediction" pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, as illustrated in Fig. 1, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate training, we propose a straightforward regularization REPA-A, which aligns encoder features with Vision Foundation Models (VFMs). Experiments demonstrate that CaTok achieves state-of-the-art results on ImageNet reconstruction, reaching 0.75 FID, 22.53 PSNR and 0.674 SSIM with fewer training epochs, and the AR model attains performance comparable to leading approaches.

CaTok: 1次元因果的画像トークン化における平均フローの制御

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

要旨

Support