CaTok：驯服一维因果性图像令牌化的均值流

摘要

自回归语言模型依赖因果标记化，但将该范式扩展至视觉领域仍具挑战。现有视觉标记器或将二维图像块展平为非因果序列，或采用与"下一标记预测"模式不匹配的启发式排序。近期扩散自编码器同样存在局限：解码器基于全部标记的条件输入缺乏因果性，而嵌套丢弃机制则会引入不平衡。为解决这些问题，我们提出CaTok——搭载MeanFlow解码器的一维因果图像标记器。如图1所示，通过选择时间间隔内的标记并将其绑定至MeanFlow目标函数，CaTok可学习支持快速单步生成与高保真多步采样的因果一维表征，同时自然捕获跨标记间隔的多样化视觉概念。为进一步稳定并加速训练，我们提出简易正则化方法REPA-A，使编码器特征与视觉基础模型对齐。实验表明，CaTok在ImageNet重建任务上达到当前最优效果：仅用较少训练周期即实现0.75 FID、22.53 PSNR和0.674 SSIM，其自回归模型性能与主流方法相当。

English

Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains non-trivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the "next-token prediction" pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, as illustrated in Fig. 1, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate training, we propose a straightforward regularization REPA-A, which aligns encoder features with Vision Foundation Models (VFMs). Experiments demonstrate that CaTok achieves state-of-the-art results on ImageNet reconstruction, reaching 0.75 FID, 22.53 PSNR and 0.674 SSIM with fewer training epochs, and the AR model attains performance comparable to leading approaches.