立方离散扩散：基于高维表示令牌的离散视觉生成

摘要

基于离散标记的视觉生成技术因其能与语言模型共享统一的标记预测范式，有望实现无缝的多模态架构而备受关注。然而，当前离散生成方法仍局限于低维潜在标记（通常为8-32维），牺牲了理解任务所需的语义丰富性。虽然预训练的高维表示（768-1024维）可能弥合这一差距，但其离散生成存在根本性挑战。本文提出立方离散扩散模型（CubiD），首个面向高维表示的离散生成模型。CubiD在高维离散表示中执行细粒度掩码——任何位置上的任意维度均可被掩码并根据部分观测值进行预测。该机制使模型能够学习空间位置内部及跨位置的丰富关联性，且生成步数固定为T（与特征维度无关），满足T远小于hwd的条件。在ImageNet-256数据集上，CubiD以9亿至37亿参数的强大扩展能力实现了最先进的离散生成性能。关键的是，我们验证了这些离散化标记能保持原始表示能力，证明同一套离散标记可同时有效服务于理解与生成任务。本研究有望推动未来统一多模态架构的探索。代码已开源：https://github.com/YuqingWang1029/CubiD。

English

Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation -- any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at T regardless of feature dimensionality, where T ll hwd. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.